Polymatheia

App Support

Sherry Towers — Thu, 02 Oct 2025 19:51:27 +0000

Support for this app can be obtained by contacting the app author at TowersConsultingLLC@gmail.com

Protected: Pinpointing areas at risk of potential partisan violence, summer 2022

Sherry Towers — Tue, 10 May 2022 02:10:37 +0000

Protected: Online application to facilitate classification of US protests since 2020

Sherry Towers — Sun, 08 May 2022 01:54:20 +0000

Protected: Hotspot analysis of extremist violence in the PNW

Sherry Towers — Thu, 14 Apr 2022 22:46:57 +0000

Protected: REgional Violence and Extremism Analytic Lens, Pacific Northwest (REVEAL:PNW)

Sherry Towers — Mon, 04 Apr 2022 16:22:32 +0000

Trying to lose those pandemic pounds? Here are the easiest diets to stick to

Sherry Towers — Sat, 19 Dec 2020 15:06:03 +0000

As we finally head towards the end of 2020, people are thinking ahead to a new year that is hopefully full of positive changes. For many people, the stresses of 2020 led to weight gain (not to mention that people working from home suddenly discovered the down-side to having access to their fridges 24/7). But even before 2020 overweight and obesity have been rampant in most developed countries.

Every New Year, millions of people resolve to diet and hit the gym, but despite these periodic efforts, obesity rates have steadily climbed over the past decades. Clearly it is hard to stick to a diet, and there is the ever constant search for the “best diet”. In this post, I’ll describe our recent paper that examined which diets are easiest to stick to by looking at trends in Internet searches for diet recipes.

If you want the TL;DR summary: the Paleo diet appeared to be easiest to stick to throughout the year, with Weight Watchers and Low Carb being close seconds… during the pandemic lockdown, the Paleo diet appeared easiest to stick to (more people actually started it than dropped out, in fact!), and second place prize went to Low Carb. Weight Watchers fared particularly poorly during the pandemic, perhaps because in-person support groups that some WW dieters attend were curtailed.

Knowing which diets are easiest to stick to could potentially help people to have better success at losing weight, but the problem is that clinical diet studies are very hard to do. Researchers have to recruit participants and follow them for weeks, months, or even years, and the dropout rate of participants is very high, so in many cases the researchers don’t know why the participants dropped out or what happened to their weight afterwards. And because of the intense monitoring associated with the studies, the results tend not to be reflective of people “in the wild” who decide to diet on their own without any medical monitoring. It is unclear whether wild dieters find it easier or harder to stick to a diet compared to clinical dieters. Also, because of the difficulties in doing clinical studies, it is almost unheard of to compare two different diets side-by-side in the same study to see how well people lost weight and were able to maintain the diet.

In 2019, myself and a group of collaborators took a data science look at which diets appear easiest for people to stick to, and published the results in our paper “How long do people stick to a diet resolution? A digital epidemiological estimation of weight loss diet persistence“. Using US Internet search data for diet recipes, we looked at a variety of popular diets, including Weight Watchers, South Beach, Paleo, Low Carb, and Low Fat diets, and examined how long new January dieters appeared to stick to them (sadly, the Keto diet hadn’t been around long enough by then to give us enough data to include in the study).

Here is what several years of Google search trend data look like for various diets (note that some diets have longer span of data than others):

There is a sharp peak in searches for all diet recipes every January, and by using a mathematical model to examine how quickly the recipe searches dropped off our team was able to estimate how long people appeared to stick to each kind of diet (assuming of course that some cross-section of people on those diets at any one time are searching for diet recipes). In addition to the large spikes in new dieters each January, the Internet search data also show a distinct dip in the number of dieters during the US winter holiday season (from late November to the end of December… you can see that in the plot above there is a dip in diet recipe searches for a month or two right before every January peak). Clearly for many people the lure of yummy holiday food is too much during that period. Assessing how easy a diet is to stick to is thus not just about New Year’s resolution diets, but also how easy it is for longer-term dieters to stick to a plan even during difficult times of year when we’re awash in delectable food.

Assessing which diets perform best

With our mathematical model that we fit to the trend data (see our paper for details), we estimated how long new January dieters appeared to stick to their diets before ceasing to search for recipes. We found that the easiest diets to stick to appeared to be Weight Watchers, Paleo, and Low Carb, all of which had compliance times of around 5 to 6 weeks. The worst were Low Fat and South Beach, with only 4 week and 3 week compliance, respectively. As far as December dieting goes, Paleo had by far the fewest people dropping out (with only around 15% of dieters dropping out), and the rest had between 20% to around 40% dropping out. Here is a chart summarising the results:

Characteristics of the best performing diets

What are the characteristics of our best performing diets? The Paleo diet allows a wide range of unprocessed foods with a focus on foods that hunter gatherers would have found familiar. The Paleo diet thus excludes grains, legumes, dairy, most alcohol (although some people feel red wine is OK), and refined sugars. Meat, eggs, veggies, fruits and honey are all allowed on the diet, so it doesn’t restrict carbs or fat. In general, the diet encourages a fairly wide range of food, which is perhaps what makes it easier to stick to. Studies have also shown that the Paleo diet is also better for weight loss compared to some other popular diets.

Similarly, the Weight Watchers diet also encourages a wide range of foods, with a focus on moderation in quantities and encouraging healthy choices over processed food. The Low Carb diet doesn’t encourage as wide a range of food, however some people find Low Carb diets fairly easy to stick to, at least for limited periods. As we see in our data though, the times of the year when there are lots of cakes and cookies can be the downfall of Low Carb dieters.

While using this data mining approach meant that we could compare diets side-by-side, it did have the drawback that we didn’t know how much people lost on each of the diets, and we don’t know their reasons for ceasing to search for recipes (was it really because they had stopped the diet?). We also didn’t know if people switched diets.

Which diet was easiest to stick to during the pandemic lockdown?

We now have over a full year more of recipe search data since we wrote our paper. We might wonder… did people find it harder to stick to their diets during the pandemic lockdowns in various US states that occurred between March and April of 2020? And, if so, which diets had the worst dropout rates?

To examine this, I did what is called “trend-correcting” the data. Diets wax and wane in popularity over the years, and first I had to divide out the long term trends. Here is what the data for 2020 look like when I do that, overlaid on the trend-corrected data from 2019:

It’s pretty clear that dieters (except those on the Paleo diet!) seemed to have a lot more trouble sticking to their diets during March and April of 2020 when social distancing from their refrigerator became difficult. Then for most diets there was a rebound effect later in 2020 where more people than usual appeared to be dieting (likely in a bid to lose their pandemic lockdown pounds).

If we look at the percentage difference between the pink 2020 curve and the prediction from the year before, here is what we get:

The Paleo diet actually gained followers during the pandemic, with up to 50% more people following that diet than we would have expected compared to the year before! The next easiest diet for people to adhere to appeared to be the Low Carb diet, which only lost around 20% more dieters than usual during the lock down. The hardest diets for people to follow during the lockdown were Weight Watchers and South Beach. For Weight Watchers, part of the issue might have been that some people following it choose to attend in-person support meetings, and during the lockdown these were likely sharply curtailed. This also likely explains why it really didn’t rebound as much as other diets after the lockdown… people were still avoiding group settings.

Summary

For new January dieters, holiday dieters, and lockdown dieters it was Paleo diet for the win!

Second place: Low Carb.

My PRK eye surgery experience

Sherry Towers — Thu, 18 Jun 2020 12:04:24 +0000

In this post I’ll describe my PRK eye surgery experience and recovery. I decided to post my story because leading up to my surgery I read many rosy stories online of fairly quick recovery timelines (ie; people being able to drive a day or few after surgery), and my recovery definitely does not fit that mould. It doesn’t mean that I am unhappy with my decision to get surgery, but I think it’s good to be aware that PRK can potentially come with a fairly extended recovery period, and people considering it should conservatively book at least a week off of work, and expect their vision to be highly variable for several weeks.

In June, 2020, I had photorefractive keratectomy (PRK) laser surgery to correct my very poor vision (-5.5 in one eye, and -5.25 in the other with significant astigmatism in both… this put me in the lowest 10th percentile for vision in my age group. I also apparently had an unusually high amount of spherical aberration in both eyes, something glasses cannot correct for, but laser surgery can). I had been thinking about having laser surgery for years, but finally decided to take the leap this year because my work-related travel has recently been much lighter than it has been in many years (ie; non-existent during the pandemic) and there would be very few conflicts with my work schedule. I have several family members who have had laser eye surgery, and they have loved the results.

I discussed my options with my optometrist, and he suggested a couple of excellent surgery centres in a nearby city. One was a chain outfit (TLC) with centres across Canada and the US, that uses a data-science approach to optimise outcomes. Not only overall outcomes across the country, but also tuned for the local climatic conditions (humidity and air pressure can affect the laser performance, so they tune the laser in each individual city to optimise overall outcomes in that city). As a data scientist, this approach has appeal to me.

The other centre he suggested had been one of the first lasik centres and had the longest experience in doing lasik of any centre in my area of the country.

I made appointments with both, but the chain outfit was the first I saw, and I ended up feeling so comfortable going with them that I cancelled the appointment with the other centre. In my consultation with the chain centre, they did a thorough eye exam, whereupon they told me I was a poor candidate for lasik because my corneas were too thin, but I was an excellent candidate for a procedure called PRK. I initially felt really disappointed, because I had my heart set on lasik, but upon reading more about PRK I learned that the outcomes are as good as (if not even better than) lasik, and it has a better safety outcome because there is no corneal flap to tear, and the risk of chronic dry eye is much less. But (and this has turned out to be a big but) it has a longer recovery time. Lasik surgery requires cutting a tiny flap in the cornea, whereas PRK involves literally burning off the entire top surface of your eye to reshape it.

I read up online about other peoples’ PRK experiences, and their recovery stories didn’t seem bad at all (most talked about driving within one to a few days after surgery), so I booked my appointment for PRK surgery a couple of weeks later. Because I work so much on a computer, I decided to get what is known as “monovision”, where one eye is tuned for close/medium distance vision, and the other eye is tuned for far distance vision. If I had gone with both eyes being tuned for far distance vision, I would need glasses for working on the computer, and given that most of my time is spent working on the computer, that would make getting laser surgery to avoid glasses somewhat pointless. With monovision, your brain seamlessly switches between inputs from the eye most suited to the current distance you’re looking at. Or at least that’s the idea… some people apparently cannot get used to it.

Here’s my surgery and recovery timeline story…

Day -14 (two weeks before surgery)

I started reading everything I could about PRK, including studies in the academic literature. It turns out that taking Omega-3 supplements can significantly improve healing time. I bought some, and began taking a double dose daily leading up to the surgery, aiming to continue for at least three months after the surgery.

Vitamin C has also been found to be beneficial for avoiding PRK complications. I’ve been taking 1000mg of vitamin C per day for years though, so I didn’t feel it was necessary to take any more than that.

Day -2 (two days before surgery)

My surgeon prescribed three kinds of drops that were to be started two days before surgery, and be used until four days after. Prolensa, to be used once a day, and gatifloxacin and Durezol to be used three times a day. Two of these (the Durezol and the Prolensa) were expensive.
The Durezol drops felt like I was being stabbed in the eyeball. 0/10 do not recommend.

I also bought a pack of preservative-free methylcellulose eye drops (they come in a big box full of little individual tubes that you break off the end to use). I bought a box of 70 thinking that would be more than enough (but spoiler alert: you’ll need a lot more than that during the recovery period).

Day -1 (one day before surgery)

In the surgical instruction packet, there is the comment that patients should avoid wearing hair spray, perfume, lotions, or any kind of scented product. It didn’t say why. I did some research, and it turns out the reason is because volatile organic compounds (VOCs) in the air can interfere with the laser. In fact, lasers have been shown to be very sensitive VOC detectors. VOCs are what make perfumes, lotions, etc smell.

The day before the surgery I bought some unscented soap to use on my hair and body (unscented soap is somewhat hard to find, btw), and washed my clothes for the next day in unscented laundry detergent.

I also dug out one of our old pairs of ski goggles from the attic, and a pair of swim goggles. I have a long standing habit of rubbing my eyes at night (in fact, that’s usually how I wake up in the morning… rubbing my eyes), and I figured that the ski goggles would help me not do that. The swim goggles would help with keeping water out of my eyes while I showered.

I used my prescribed eye drops on schedule.

Day 0 (surgery day!)

I used my prescribed eye drops on schedule.

My surgery was in the late afternoon. My husband drove me there, but he had to wait in the vehicle rather than inside because of the pandemic. He said he just took a nap.

I went inside, and they fitted me with a mask (the one I brought was cloth and not sufficient apparently), and a surgical cap to keep my hair out of the way.

Then they put in several different kinds of drops, and gave me a tablet of valium to dissolve on my tongue. This is the first time I’d ever used valium. Things get kind of hazy in my memory after the valium, because apparently it works pretty well….

After leaving me to sit for some period of time in a dim room (I really have no idea how long), they led me into the room with the laser. They made me comfortable on the bed, and then did the procedure. The main thing I remember was the horrible smell of burning hair/skin (actually, burning cornea… my clothes reeked afterwards). I also remember chatting enthusiastically with/(babbling enthusiastically at) the surgeon about VOCs and lasers, and the fact that I took laser physics during my undergrad degree. I really don’t remember what, if anything, he said in return.

I also really don’t remember much about the end of the procedure. I do remember the assistant giving me a little kit bag with dark sunglasses, night shields, and drops, etc, and leading me out to the vehicle. I barely remember anything about the drive back… I think I slept. And once I got home I went to bed for a nap.

Valium. 10/10 recommend.

My husband made burritos for dinner, and after dinner I went to bed again. The night shields the surgeon’s office gave me had to be taped on, and the tape gets stuck in your hair… so I went with the ski goggles, and they have turned out to be an excellent idea. They have completely prevented me from rubbing my eyes, and they are relatively comfortable to sleep in.

My vision was super blurry on day 0, but I wasn’t in any real discomfort. I’ve noticed in many other peoples’ recovery stories they talked about having great vision right after the procedure. That didn’t happen for me at all.

Day 1 of recovery

I had to wake up several times in the night to put methylcellulose drops in my eyes. I woke up in the morning still under the influence from the valium (it has a nearly two day half life, so I was still partying well into day 1).

I used my prescription eyedrops on schedule.

My vision was super blurry, with significant halos and haze. I couldn’t see anything near or far because it was like looking through a sheet of waxed paper. But I wasn’t in any real discomfort.

All my post-surgery follow-ups are with my local optometrist. He was out that day, so I saw his partner for my one day checkup. He popped up the eye chart and asked me what line I could read. I could barely tell there was an eye chart there let alone read any of the lines. He scrolled up to even bigger letters. Still nothing. Even bigger letters…
“I think that’s an E?”.

Needless to say, my vision was not 20/20 on day 1.

I napped a couple of times during that first day, and also listened to movies I’d already seen before, so I knew already what was going on on the screen.

Day 2

I had significant discomfort the night before, and woke up many times to put in rewetting drops. Around 4am I finally took one of the tylenol 3’s the surgeon had prescribed. I managed to sleep for several more hours.

I woke up in the morning with the valium worn off, and extremely blurry vision (still like looking through waxed paper). Most of the PRK recovery stories I had read online talked about how their vision got better day by day, with several people saying that they were driving themselves by day 2 or 3. The thought of me driving at this point was laughable. My vision was so bad, it was actually kind of distressing. The only reason I was able to marginally function at all was because I was in the familiar environment of my house where I knew where everything was.

I used my prescription eye drops on schedule.

Virtually my entire work life involves my computer, but I couldn’t even read email at this point, even with my screen zoomed in 400%. I had warned my colleagues I would be out of commission probably “several days to maybe a week”, so luckily no one important was trying to reach me anyway.

Leading up to the surgery, I had decided a good activity to occupy myself might be to clean out closets and drawers because that didn’t require any kind of visual acuity, so that is how I amused myself for much of day 2, along with trying to take a couple of naps.

Day 3

The night before I still had to wake up several times to put in rewetting drops, but I was in significantly less discomfort. Except for that second night, discomfort has been pretty minimal.

I continued taking my prescription eye drops on schedule.

Vision was still extremely blurry, both near and far, and the difficulties seeing my laptop were aggravating.. my entire work life is centred on my laptop, and inability to read anything on it was very frustrating. Zooming in and inverting the colours helped a little bit. I also had significant haloing and starburst effects when seeing the sun shine off of car windows or metal.

My husband and I went shopping, and I had to get him to read the fine print on things like expiration dates. Other than that, I could see packaging well enough to pick out familiar products.

I worked more on cleaning out closets, and tried taking a nap.

Day 4

As usual I had to wake up several times in the night to put in rewetting drops. But again, not much discomfort.

I continued taking my prescription eye drops on schedule (my last day for the drops). Edit: I found out, two weeks later, that I was supposed to have kept on using the Durezol eye drops for a month… my prescription from the pharmacy was mis-labelled, and it took two weeks for my optometrist and the surgery centre to catch the mistake).

In the morning, I noticed that while my distance vision was still extremely blurry, I was beginning to be able to read the print in books and magazines. But not the computer… there is something about the light of the computer that made print on it look hazy.

I continued Marie Kondo-ing our closets.
There literally was nothing left at this point in the closets that didn’t spark joy.

Day 5

Today was the day to get my bandage contacts out!

My distance vision was still super blurry, and while I could read magazine print, it was still blurry and hazy. My optometrist did an eye exam, mapping my current eye contours. He then took my bandage contacts out (which turned out to be a straightforward and painless procedure). He took a careful look at my corneas, and said that they were healing very well, and looked like they were at the three-week mark instead of the five-day mark.

I asked him why my vision was still so super blurry, even though I had read all these PRK recovery blogs where people were talking about driving within a day or few after their surgery. He said that those people almost certainly did not have an extreme vision correction, and that because of my bad astigmatism a significant amount had to be shaved off my corneas. The blurry vision I was experiencing was completely normal for the type of correction I’d had, and I’d likely continue to have blurry vision for a few more weeks before it really started to clear up.

So, I guess this is one reason I’m writing about my experience; be aware that everyone has their own timeline for recovery, and that the timelines you read for some peoples’ recovery may be overly optimistic for the average. If you are considering PRK, book at least a week off of work, and don’t plan on driving for at least one to two weeks.

We did the eye chart, and I was having problems seeing anything on the distance chart because of the blurriness, but interestingly my distance eyesight was somewhat better using both eyes rather than just the eye corrected for distance vision (my right eye). Somehow my brain was incorporating the crappy distance info coming from my left eye to help my right eye out, which is kind of cool (with both eyes my vision was 20/60). With the near distance eye chart I was able to read the smallest lines of text (with frequent blinking to clear the haze).

I asked my optometrist why my near vision was recovering faster than my distance vision, and he said the astigmatism was worse in my right eye, and because more had to be shaved off of it, it will take a little longer to recover. He told me that I was recovering very well, and he was pleased with my progress so far.

Before I left, my optometrist said that the surgery centre had said in their notes that I was an excellent patient. I laughed and said that he probably said that to everyone, and he said no no… they had made a note about me chatting with the surgeon about the laser.
Lol… that would be me, riding high on valium, babbling about VOCs and lasers.

Without the bandage contacts in my eyes felt more scratchy than they had before, so on the way home from the optometrist my husband and I stopped at the pharmacy to buy more methlycellulose drops (I should have bought stock in the company that makes them before my surgery).
Once back in the car, I cracked a vial and put the drops in, and suddenly, just for a second, my distance vision was sharp and clear. It was startling, but even though the haze and blur returned almost immediately, it at least was a sign that things will get better.

That evening at home I was able to read email, and my laptop is getting noticeably easier to see (although I was still zoomed way in, had to invert the colours, make my cursor much larger, etc).

Day 6

I began this post about my recovery the morning of day 6. But now, after writing for over an hour, my vision has become extremely blurry and my eyes are scratchy. I have clearly over-done it, so I’ll stop here for the moment, and carry on later.

(Later)

My near vision was in and out all day, sometimes very crisp, but then suddenly going to very blurry in a way that only taking a bit of a break and closing my eyes for a bit would fix. My distance vision was still terrible. My eyes also felt rougher throughout the day, which apparently is to be expected when the contact bandages come out. It apparently takes several more days for the new epithelium layer to finish healing and smooth over. But it probably doesn’t help that I overdid it on the laptop, which involves a lot less frequent blinking than is likely heathy at the moment.

Day 7

I woke up several times during the night, perhaps out of habit by this point. But I didn’t feel the need to actually put in drops because my eyes didn’t feel dry enough to warrant bumbling around in the dark looking for the little tubes.

Upon waking up in the morning, I noticed I could see the clock across the room more clearly than I had since the surgery. I peeked outside, and the trees in our yard are definitely looking like they have distinct leaves. Trees further away were still an amorphous green blur though.

Near vision clarity through the day was in and out, but when it was clear it was very clear. Distance vision is improving over what it was, but still nowhere near good enough to drive safely. As the day wore on my distance vision became increasingly blurry. I went for a long walk with my husband, and even though I was wearing a hat and sunglasses, the bright glare from the sun really started to get to me. It was a relief to get back in the house into a room with the blinds down.

Day 12

Today was the day for another follow up with my optometrist. My corneal healing is still apparently progressing very well. Over the past several days my distance vision has slowly improved somewhat, although not nearly enough to drive safely. The vision chart says I am seeing 20/30 today when using both eyes (worse when just using the distance-only eye). However, my far distance visual acuity literally changes blink by blink, so 20/30 was achieved by my “best blink” effort when staring at the chart. My average distance visual acuity is much worse.

My near distance is good enough that I can now work on my laptop for several hours at a time (zoomed in and with the colours inverted). However, I get eye strain easily and a few times a day my near vision will abruptly go very blurry and stay that way unless I just take a break and rest my eyes for a while. There is literally nothing I can do get them to focus when that happens, and even wearing +1 readers does not help. My optometrist says that after significant correction, the brain has to get used to the new vision (especially monovision where one eye is responsible for all the near vision), and at some point there is just overload if you over-do it.

I am going through at least a couple of dozen methycellulose mini-droppers each day to manage the scratchy feeling in my eyes. I still wear my old ski goggles to bed to keep myself from my bad habit of rubbing my eyes (it works great for that), but I’m no longer waking up in the middle of the night to put drops in.

Day 17

I just got a phone call from the centre that did my surgery, because they were looking at the notes sent to them from my optometrist that said “patient still using rewetting drops as needed”. They asked if I was also still taking the steroid drops (Durezol). I said, “No, the instructions on the bottle said to stop four days after the surgery”. She said, “but there are post-surgery instructions in the kit we gave you that told you that you needed to take them for a month after surgery”.

I don’t recall any note like that (remember, I was flying high on valium at the time), but digging around in the kit sure enough revealed a note at the bottom with those instructions.

Anyway, I am annoyed that the instructions on the bottle that came from the pharmacy weren’t correct, and I’m annoyed that the instructions from the centre were not given to me at a time when I wasn’t completely fried on valium.

So now I am back on the Durezol drops (which feel like being stabbed in the eyeball) 3x a day for three days, 2x a day for two weeks, and 1x a day for another two weeks. Hopefully this mixup hasn’t screwed up my eye recovery.

In other recovery-related news, I finally feel confident enough to drive, but my far vision clarity is still worse than my near vision clarity. My near vision clarity can be remarkably crisp. Better than what I had before, in fact.
I did some sewing over the weekend, and while threading needles I did miss the loss of my ultra-microscopic close vision that I used to have (because I was so severely myopic), but a pair of +1 readers easily compensated for the loss.

Day 26

Today I had a check-up with my optometrist. My far vision is now 20/20 but it is still hazy. While it is good enough for driving, the haze/blur still makes driving not-enjoyable. In fact, the Durezol drops appear to have made my far distance vision even worse than it had been just before I started them again. But apparently this is a common side effect of that medication, and nothing to worry about.

In contrast, the past couple of days my near distance vision has been super crisp early in the day. So much better than it had been before with glasses in fact, that sometimes I find myself pausing in wonder to just take in the detail of what I can see. But by late afternoon things tend to degrade as my eyes get tired.

My optometrist was very apologetic about the Durezol mix-up (although it was not his fault). Apparently the surgery centre had sent him the after-care instructions for lasik surgery, not PRK (with lasik you stop the Durezol four days after). Which he thought was odd. He said I shouldn’t worry about the two weeks I wasn’t taking the drops… my recovery will just be a bit more extended.

He did a full eye exam. My near distance eye when corrected for far distance is very good; 20/10. The best I can do with my far distance eye (even with the correction machine the optometrist uses) is still a somewhat hazy 20/20 (the machine didn’t correct much). Again, my optometrist re-iterated that the right eye (the far distance eye) had a very significant correction, and there is still inflammation from that, and it just needs time to heal.

In other news, I am still wearing the ski goggles when I sleep, because the one night I didn’t wear them I woke up rubbing my eyes (which is a big no-no). Also, old habits seem to die hard… I still find myself feeling around for my glasses in the morning, and every time I go to put in eye drops I make a motion to take off my glasses.

Two months post-surgery

I had my two-month postop appointment today, and the haze in my right eye (the one corrected for distance) has almost completely cleared up. After several more weeks of continuing to take them, I am no longer on the stabby steroid eyedrops (Durezol). My distance vision has slowly improved over the past month, and is now at least 20/15 (ie; better than 20/20). However, because I had monovision surgery, my depth perception is lacking, and I’m not a big fan of driving because turning left onto a busy street can be somewhat nerve wracking. In that sense, my distance vision is not as “whole” as it was before, even though the absolute best I ever achieved before with glasses was a somewhat blurry 20/20.

Next month I have my three-month postop appointment, and if my distance vision is stable from what it was today, I will get a prescription for driving glasses to correct my left eye (the near distance eye) such that my depth perception will be fully restored. Today the optometrist used his little clicky-clicky rotating eyeglasses machine to show me what that would look like, and my distance vision will easily be at least 20/10 once I get the glasses. It was unbelievable how good my distance vision was with that eye corrected as well.

I continue to be extremely happy with my near distance vision.

And I no longer wear the ski googles to bed.

Reduction in all-cause deaths during the COVID-19 shutdown

Sherry Towers — Tue, 07 Apr 2020 14:17:00 +0000

Since 1962, the CDC has monitored death certificates on a weekly basis from 122 cities in the US. The CDC tallies the number of death certificates, and the number of deaths due to pneumonia, and the number of deaths due to influenza. Here I examine these data during the US COVID-19 shutdown in March, 2020. Aggregated across the country, all-cause deaths are significantly down. However, there is significant geographical variation, with some states (like New York) showing extreme excess in mortality.

Up until 2016, the CDC made death certificate tallies from 122 cities available off of its Morbidity and Mortality Weekly Reports (MMWR) website. It is also available here.
Since 2016, these data are available off of the CDC weekly FluView website, which monitors influenza and influenza like illness on a weekly basis. The data are available at the state level from here.

The website monitors the fraction of all deaths due to pneumonia and influenza via the above plot. The black lines are the expected normal seasonality, and are known as the Serfling curve. Excursions above this indicate unusual activity. You can see that the most recent two weeks of data show an increase in the percentage of deaths due to pneumonia.

Clicking on “View Chart Data” below that plot yields a comma delimited file with the year, the week of that year, the percentage of all deaths due to pneumonia or influenza, the total number of deaths, the total number of deaths due to pneumonia, and the total number of deaths due to influenza. Note that the most recent week of data is always incomplete (due to a time lag in collecting death certificates) so should always be discarded from the file when looking at total number of deaths.

I downloaded these data up to April 9, 2020 (week 13), and created the following plot of the total number of deaths by week up to the end of March:

The total number of all-cause deaths is *way* down! Over the month of March, the deficit is around 15,000 deaths compared to previous years, meaning around 15,000 people are still alive right now who otherwise would likely have died last month.

This is almost certainly due to the shutdown, where many people in the US have been staying at home. Traffic accidents have gone down (the top cause of accidental death in the US), and the social distancing measures are affecting the transmission of most infectious diseases (which also likely explains why pneumonia deaths went down during much of that period, but have started to creep up as COVID-19 deaths become more prevalent); there is evidence from the Serfling plot above that pneumonia deaths are likely going to shoot up in the coming weeks, but during March the pandemic was just getting going, so most of the pneumonia deaths then were not due to COVID-19, but other infectious diseases. And the social distancing measures interfered with the transmission of those infectious diseases in a quite dramatic way.

There have already been over 10,000 confirmed deaths due to COVID-19 in the US, and the daily number of deaths is still growing exponentially. The deficit in all-cause deaths compared to previous years unfortunately is thus not likely to last because COVID-19 is expected to rapidly become the leading weekly cause of death in the US within the next few days.

In fact, areas of the country where the COVID-19 was circulating early on have distinctly different patterns in mortality compared to areas where it was not significantly circulating in March. For example, here are the data for New York City:

Even the influenza deaths are up. This is likely not due to a spike in influenza cases in NYC, but more likely due to an attempt to conserve scarce COVID-19 tests by first testing for whatever other respiratory viral illnesses might have been the cause of death (influenza tests are readily available). Thus, testing for influenza is likely happening far more often than normal.

Many (if not most) other states show a pattern where both all-cause and pneumonia deaths are down because COVID-19 was not significantly circulating in those states in March. For example, Texas:

Most states shutdown before COVID-19 was significantly circulating, and thus their shutdown so far has done a good job of preventing the pandemic from progressing in their area, and also has done a good job of suppressing the regular viral crud that tends to circulate this time of year. But this causes a problem if those states return to “business as usual” in the near term, because there will be a reckoning in the all-cause mortality counts; many of the people who would have died the past month (but didn’t) have serious co-morbidities that make them much more susceptible to dying should they catch an infectious disease. Those people still have those co-morbidities. And, if social distancing measures are lifted and COVID-19 and other diseases start significantly circulating, those people are likely to die, along with the regular cohort of people who would also normally die that time of year. Thus, once social distancing measures are lifted, we will very likely see a spike in all-cause mortality.

Estimating doubling time of US COVID-19 pandemic in Canada, Germany and the US

Sherry Towers — Sat, 21 Mar 2020 19:16:42 +0000

In this post I estimate the doubling time of the COVID-19 pandemic in the US, Canada and Germany using daily death count data. At the current rate, COVID-19 deaths will exceed 2009 H1N1 deaths by early April in the US, and deaths due to COVID-19 are accelerating faster in the US than the rate of coronavirus deaths in Italy at the beginning of the pandemic there. Germany and Canada show lower rates of spread than the US, but do not yet show evidence of “flattening of the curve”. The WorldOMeter website has been keeping track of confirmed COVID-19 daily case and death counts by country: https://www.worldometers.info/coronavirus/country/us/ The time trends in the number of deaths is likely much more reliable to track the spread of COVID-19 compared to trends in the number of case counts, because case counts are heavily affected by changes in test availability.

United States

The cumulative number of COVID-19 deaths in America at the time I first began this post (March 21st) looks like this: I downloaded the current data (I regularly update this post), and fit an exponential to the number of new deaths per day to estimate the pandemic doubling time: During the 2009 H1N1 pandemic, there were 12,469 deaths in the US. On March 20th, when I originally wrote this post, my model estimate was April 7th when the number of COVID-19 deaths would exceed the 2009 flu deaths. Sadly, that estimate has turned out to be exact. The COVID-19 pandemic is now worse than that of 2009, and will unfortunately become much worse than it already is.

The doubling rate of deaths in the US is accelerating faster than the doubling rate of deaths in Italy near the beginning of the pandemic there. Here is a plot of the number of deaths per day after the day at which each country achieved at least 10 deaths. You can see the US trend line is diverging from the Italian trend line:

Canada

Fitting an exponential to the daily number of deaths in Canada yields: The pandemic appears to be spreading slower in Canada than in the US. There were 740 deaths in Canada due to pandemic influenza during 2009. At the current doubling time, deaths due to COVID-19 will exceed the deaths due to 2009 pandemic influenza by April 21st. Canada’s rate of rise in deaths is currently significantly less than that of Italy at the same point in their pandemic:

Germany

Fitting an exponential to the daily number of deaths in Germany yields: The rate of rise in deaths in Germany is currently less than the rate of Italy at the same point in their pandemic:

Visual analytics application for rise/set azimuth of celestial bodies by latitude

Sherry Towers — Wed, 15 Jan 2020 19:22:56 +0000

I have created an online analytics application that calculates the rise/set azimuths of the Sun and Moon at various times in their calendrical cycle (which for the Sun takes one year, and for the Moon takes on average 18.6 years), along with the rise/set azimuths of all bright stars. The application allows the user to select on the latitude of the site of interest, and also the date it was built (from 5000 BCE to 2000 CE). It also allows the user to correct for the elevation angle of the horizon… particularly in hilly terrain, this can change the rise/set azimuth of celestial bodies significantly. The application uses the Python pyephem package to calculate the rise/set azimuths, and the azimuths are corrected for refraction.

The application can be found at https://archaeoastronomy.shinyapps.io/rise_set_full_range/

An application with a more restricted range in latitude to latitudes in the Northern Hemisphere, which can be a bit easier to use for fine latitude adjustment for locations in that hemisphere can be found at https://archaeoastronomy.shinyapps.io/rise_set_restricted_range/

Note that “sunrise” and “moonrise” are calculated for the top of the disk hitting the horizon. The Sun and the Moon each subtend approximately half a degree of arc, so to obtain the rise azimuth angle for the entire disk to just be visible, add 0.5 degrees to the elevation angle of the horizon in the application.

Fitting to two or more data sets simultaneously with the graphical Monte Carlo method

Sherry Towers — Wed, 03 Apr 2019 16:21:21 +0000

In this module, we will discuss how to apply the graphical Monte Carlo method for fitting the parameters of dynamical models to data when fitting to two or more data sets simultaneously

Sometimes, when fitting the parameters of a dynamical model, the fit involves optimising the model prediction to two or more data sets simultaneously. An excellent example of this is fitting a Lotka-Volterra predator/prey model to the time series of the predators and the prey (for example, wolves and elk). Often, the prey can vastly outnumber the predators, thus the average values in the two data sets can be at completely different scales.

When using the graphical Monte Carlo method with the fmin+1/2 method for estimating of model best-fit parameters and confidence intervals, one must minimise a negative log-likelihood statistic to determine the best-fit parameters. Depending on the nature of the data and the probability distribution that underlies the stochasticity in the data, such goodness-of-fit statistics might include the Negative Binomial, Poisson, Binomial, or Normal negative log-likelihood statistics. Recall that the Normal negative log-likelihood statistic can be derived from the Least Squares (LS) statistic by the transformation:

Where min(LS) is the best-fit value of the Least Squares statistic, and N is the number of data points to which the model is being fitted.

When fitting to multiple data sets at once, one simply adds the negative log-likelihood statistics for the sample into a grand negative log-likelihood statistic, and finds the parameters that minimises the grand statistic.

Note, that when using a Negative Binomial negative log-likelihood statistic, the fit must also include the over-dispersion parameter for that probability distribution, alpha. However, when fitting to multiple data sets that likely have very different average scales, the alpha’s must be different in the likelihood calculations for the separate data sets that are then combined into the grand likelihood. Thus, you will need to fit for an extra parameter for every data set you fit to.

For the Least Squares statistic, the number of points uses in the calculation of the separate Normal negative log-likelihood statistics is the number of points in each particular sample. And the minimum LS used in the calculation is the minimum value for each particular sample. One then adds the separate Normal negative log-likelihood statistics to obtain the grand statistic.

The grand negative log-likelihood statistic for whichever goodness-of-fit statistic is used is then plotted versus the model hypotheses, and the best-fit parameter values are the values that minimise this statistic. The one standard deviation uncertainties can be determined using the fmin+1/2 method.

Testing if one model fits the data significantly better than another model

Sherry Towers — Mon, 18 Mar 2019 13:17:11 +0000

When doing Least Squares or likelihood fits to data, sometimes we would like to compare two models with competing hypotheses. In this module, we will discuss the statistical methods that can be used to determine if one model is significantly statistically favoured over another.

In this past analysis, my colleagues and I examined mass killings data in the US, and fit the data with a model that included a contagion effect, and another model that included temporal trends but no contagion effect. Because the data were count data (number of mass killings per day), we used a Negative Binomial likelihood fit.

And in this past analysis, which was an AML 612 class publication project, we examined data from a norovirus gastrointestinal disease outbreak aboard a cruise ship, and fit the data with a model that included both direct transmission, and transmission from contaminated surfaces in the environment. We also fit a model with just direct transmission, and a model with just environmental transmission. Again, because the data were count data (number of cases per day), we used a Negative Binomial likelihood fit.

In this past analysis, which was also an AML 612 class publication project, we examined US Twitter data and Google search trend data related to sentiments expressing concern about Ebola during the virtually non-existent outbreak of Ebola in the US in 2014. We fit the data with a mathematical model of contagion that used the number of media reports per day as the “vector” that sowed panic in the population, and included a “recovery/boredom” effect where after a period of time, no matter how many news stories run about Ebola, people lose interest in the topic. We compared this to a simple regression fit that did not include a boredom effect.

What helped to make these analyses interesting and impactful was the exploration of what dynamics in the model had to be there in order to fit the data well. When you have the skill set to fit a model to data with the appropriate likelihood statistic and estimate model parameters and uncertainties, it opens up a wide range of interesting research questions you can explore… you fit the model to the data as it is today, and then using that model you can explore the effect of control strategies. However, if you also add in the ability to make statements about which dynamical systems fit the data “better” than alternate modelling hypotheses, you’ll find there is a lot of very interesting low-hanging research fruit out there.

For the purposes of the following discussion, we will assume that we are fitting to our data using a negative log-likelihood statistic (recall that the Least Squares statistic can be transformed to the Normal negative log-likelihood statistic).

Likelihood ratio test for nested models

When two models are “nested” meaning that one has all the dynamics of another (ie: all the dynamics of a simpler “null model”, plus one or more additional effects), we can use what is known as the likelihood ratio test to determine if the more complex model fits the data significantly better. To do this, first calculate the best-fit negative log-likelihood of the null model (the simpler model):

x_0=-log(L_0)
Then calculate the best-fit negative log-likelihood of the more complex model:

x_1=-log(L_1)

The likelihood ratio test is based on the statistic lambda = -2*(x_1-x_2). If the null model is correct, this test statistic is distributed as chi-squared with degrees of freedom, k, equal to the difference in the number of parameters between the two models that are being fit for. This is known as Wilk’s theorem.

If the null hypothesis is true, this means that the p-value

p=1-pchisq(lambda,k)    Eqn(1)

should be drawn from a uniform distribution between 0 and 1.

If the p-value is very small, it means the null hypothesis model is dis-favoured, and we accept the alternate hypothesis that the more complex model is favoured. Generally, a p-value cutoff of p<0.05 is used to reject a null hypothesis.

Note that more complex models generally fit the data better than a simpler nested model (ie: the minimum value of the negative log-likelihood statistic will be smaller). This is because the more complex model has more parameters, and those extra parameters give the model more wiggle room to fit to variations in the data. But the problem is that more parameters also increases the uncertainty on all the parameter estimates; so, yes… it might give a lower negative log-likelihood, but there is a cost to be paid for that in the number of parameters you had to add in. The more parameters you add in, the more you are at risk of “over-fitting” to statistical fluctuations in the data, rather than trends due to true underlying dynamics.

The Wilk’s likelihood ratio test in effect penalizes you for the number of extra parameters you are fitting for (that k in Eqn 1 above). The higher k is, the lower the best-fit negative log-likelihood for the more complex model has to be in order for the null model to be rejected.

Example

One example of the kind of research question that can be answered using this methodology: when fitting to data for a predator-prey system (say, coyotes and rabbits), we can examine whether or not a Holling type II model fits the data significantly better than a Holling type I model. In the Holling type I model, the number of prey consumed, Y, is a linear function of the density of prey, X, the discovery rate, a, and the time spent searching, T:

Y = a*X*T               Eqn(2)

In a Holling type II model, the relationship is

Y = a*X*T/(1+a*b*X)     Eqn(3)

Note that the Holling type I model is nested within the Holling type II model when b=0, and thus a likelihood ratio test can be used to determine if one model fits the data significantly better. The Holling type II model has one extra parameter being fitted for compared to the Holling type I model.

For example, if our “null” Holling type I model when fit to the data with b=0 yields a best-fit negative log-likelihood of x_0=-log(L_0)=900, and our “alternate” Holling type II model yields a best-fit negative log-likelihood of x_1=-log(L_1) = 898.5, we would calculate the negative log test statistic as lamba = -2*(x_1-x_0)=2*1.5=3. If the null hypothesis is true, then lambda should be distributed as chi-squared with degrees of freedom, k, equal to the difference in the number of parameters between the two models (in this case, k=1, because the only difference in the parameters between the two models is the addition of the parameter b). Thus, the test is

pvalue_testing_null=1-pchisq(lambda,difference_in_degrees_of_freedom) 
pvalue_testing_null=1-pchisq(3,1)

which yields pvalue_testing_null=0.083. Thus in this case, we would say there is no statistically significant evidence that the Holling type II model fits the data better than the Holling type I model. This doesn’t mean, btw, that the Holling type II model is “wrong” and the Holling type I model is “right”. It simply means that based on these particular data, there is no statistically significant difference. If you had more data, it increases your sensitivity to detecting differences in the dynamics.

If, for example, you fit the two models to another, larger, data set, and find x_0=532 and x_1=521, then in this case

pvalue_testing_null=1-pchisq(-2*(521-532),1)

and we get pvalue_testing_null=2.7e-6, which is very small indeed, and we conclude that the Holling type II model fits the data significantly better. In a paper, we would state these results along these lines: “The Negative Binomial negative log-likelihood best-fit statistics for the Holling type I and Holling type II models were 531 and 521, respectively. The likelihood ratio test statistic has p-value<0.001, thus we conclude the dynamics of the Holling type II model appear to be favoured over those of the type I model.”

What to do if the models being compared aren’t nested

In all of the examples above, we assumed that the models were nested: for example, in the contagion in mass killings analysis, the model without contagion but just with temporal trends was nested within the model with contagion and temporal trends. In the norovirus analysis, the models with just direct transmission, and just environmental transmission, were nested in the model that contained both effects.

But what if we are trying to compare two plausible models that aren’t nested? In that case, we can use the Aikake Information Criterion (AIC) statistic, which is twice the best-fit negative log-likelihood, plus two times the number of parameters being fit for, q:

AIC = 2*(best-fit neg log likelihood) + 2*q

Note that the AIC statistic penalizes you for the number of parameters being fit for… increasing the number of parameters might help bring the negative log-likelihood down, but it can result in an increase of the AIC.

As an overly general statement, models with low AIC are usually preferred over models with high AIC. In fact, in your Stats 101 journeys, you may have heard that you should simply pick the model with the lowest AIC as being the “best” model. However, it’s not quite that simple. Sometimes, one model might give a slightly lower AIC than another, but that does not mean that it is definitively “better”….

Quantitatively using AIC to compare models

We can use what is known as the “relative likelihood” of the AIC statistics to quantitatively compare the performance of two models being fit to the same data to determine if one appears to “significantly” fit the data better.

To do this, we first must calculate the AIC of the two models. If the negative log-likelihood of model #1 is x_1=-log(L_1), and the negative log-likelihood of model #2 is x_2=-log(L_2), then the AIC statistic for model number 1 is

AIC_1 = 2*x_1 + 2*q_1

where q_1 is the total number of parameters being fitted for in model #1, and the AIC statistic for model #2 is

AIC_2 = 2*x_2 + 2*q_2

where q_2 is the total number of parameters being fitted for in model #2.

Now, find the minimum value of the two AIC statistics,

AIC_min = min(AIC_1,AIC_2)

Then, for each model, calculate the quantity:

B_i=exp((AIC_min-AIC_i)/2)

And then for each model calculate what is known as the “relative likelihood” (see also this paper):

p_i = B_i/sum(B_i)

If one of the models has a relative likelihood p_i>0.95, we conclude it is significantly favoured over the other.

As mentioned above, sometimes people just choose the model that has the lowest AIC statistic as being the “best” model (and in fact this is very commonly seen in the literature), but problems arise when there is only a small difference between the AIC statistics being compared. If they are very close, one model really is not much better than the other. Calculation of the relative likelihood statistic makes that apparent.

You can use AIC statistic to compare nested models, but if the models truly are nested, then the Wilk’s likelihood ratio test is preferred.

AIC Example #1

As an example of the use of AIC statistics to compare models let’s examine our Holling type I/II hypothetical analysis (even though it is a nested model example, and the Wilk’s likelihood ratio test would be the preferred method to compare the models).

If our Holling type I model when fit to the data with parameter b=0 yields a best-fit negative log-likelihood of x_1=-log(L_0)=900 and one parameter is being fit for (the “a” in Equation 2) then the AIC statistic for that model is

AIC_1 = 2*900 + 2*1=1802

If the fit of our Holling type II model yields a best-fit negative log-likelihood of x_2=-log(L_1) = 898.5, when two parameters are being fit for (the “a” and “b” in Equation 3), then the AIC statistic for that model is

AIC_2 = 2*898.5 + 2*2=1801

Thus, the AIC of the Holling type II model looks to be just a little bit lower than that of the Holling type I model. The B statistics for the two models are

B_1 = exp((1801-1802)/2) = 0.606 
B_2 = exp(0) = 1

And the relative likelihoods are

p_1 = B_1/(B_1+B_2) = 0.606/1.606 = 0.377 
p_2 = B_2/(B_1+B_2) = 1/1.606 = 0.623

Neither of these p_i’s are greater than 0.95, so we conclude that neither model is significantly favoured over the other.

In a paper, we would state these results along the following lines: “The AIC statistics derived from the Holling type I and Holling type II model fits to the data are 1802 and 1801, respectively. Neither model appears to be strongly preferred [1]”.

With reference [1] being:

Wagenmakers EJ, Farrell S. AIC model selection using Akaike weights. Psychonomic bulletin & review. 2004 Feb 1;11(1):192-6.

However, as noted above, because these are actually nested models, a better choice would be Wilk’s test. In practice, you should only use AIC for non-nested models.

AIC Example #2

If our Holling type I model when fit to another set of data with parameter b=0 yields a best-fit negative log-likelihood of x_1=-log(L_0)=532 and one parameter is being fit for (the “a” in Equation 2) then the AIC statistic for that model is

AIC_1 = 2*532 + 2*1=1066

If the fit of our Holling type II model yields a best-fit negative log-likelihood of x_2=-log(L_1) = 521, when two parameters are being fit for (the “a” and “b” in Equation 3), then the AIC statistic for that model is

AIC_2 = 2*521 + 2*2=1046

The AIC of the Holling type II model is lower than that of the Holling type I model. The minimum value of the AIC for the two fits is 1046.

The B statistics for the two models are thus

B_1 = exp((1046-1066)/2) = 4.5e-5 
B_2 = exp(0) = 1

And the relative likelihoods are

p_1 = B_1/(B_1+B_2) = 4.5e-5/1.000045 = 4.5e-5 
p_2 = B_2/(B_1+B_2) = 1/1.000045 = 0.99996

The relative likelihood of the second model is greater than 0.95, thus we conclude it is significantly favoured.

In a paper we would say “The AIC statistics derived from the Holling type I and Holling type II model fits to the data are 1066 and 1046, respectively. We conclude the Holling type II model is significantly preferred [1].” With reference [1] again being:

Wagenmakers EJ, Farrell S. AIC model selection using Akaike weights. Psychonomic bulletin & review. 2004 Feb 1;11(1):192-6.

AIC Example # 3

The R script example_AIC_comparison.R generates some simulated data according to the model A*exp(B*x)+C with Poisson distributed stochasticity. It then uses the graphical Monte Carlo method to fit a linear model (intercept+slope*x) and the model A*exp(B*x)+C to the simulated data using the Poisson negative log-likelihood as the goodness of fit statistic.

After 50,000 iterations of the Monte Carlo sampling procedure, the script produces the plot:

Based on the best-fit negative log likelihoods and the number of parameters being fit for for each model (two for the linear model, and three for the exponential model), the script also calculates the AIC statistic, and the relative likelihoods of the models derived from those statistics.

The script outputs the following:

We see that the relative likelihood of the exponential model (which was actually the true model underlying the simulated data) is significantly favoured over the linear model.

Something to try is instead of using integer values of x from 0 to 150 in the script, use integer values from 0 to 50. You will find with this smaller dataset that there is no statistically significant difference between the two models… there is simply not enough data to make the determination.

Graphical Monte Carlo method: choosing ranges over which to sample parameters

Sherry Towers — Tue, 26 Feb 2019 19:33:47 +0000

In this module we will discuss how to choose ranges over which to sample parameters using the graphical Monte Carlo method for fitting the parameters of a mathematical model to data. We will also discuss the importance of using the Normal negative log-likelihood statistic (equivalent to Least Squares) when doing Least Squares fitting, rather than the Least Squares statistic itself.

In this past module, we discussed the graphical Monte Carlo method for fitting model parameters to data. In this module, we described how to estimate the one standard deviation uncertainties in model parameters using the “fmin+1/2” method, where fmin is the minimum value of the negative log likelihood.

Also in that module, we discussed that the Least Squares statistic (LS) is related to the Normal distribution negative log likelihood via

where min(LS) is the minimum value of the Least Squares statistic that you obtained from your many sampling iterations of the graphical Monte Carlo method.

When doing Least Squares fits with the graphical Monte Carlo method, to facilitate choosing the correct ranges used to Uniformly randomly sample the parameters and to estimate the parameter uncertainties, you should plot the Normal negative log-likelihood versus the model parameter hypotheses, rather than the Least Squares statistic.

Similarly, the negative log likelihood of the Pearson chi-squared weighted least squares statistic is

and if using that statistic, you should use this negative log-likelihood in your fits in order to assess the parameter uncertainty.

Choosing parameter ranges

Once you have your model, your data, and a negative log-likelihood statistic appropriate to the probability distribution underlying the stochasticity in the data, you need to write the R or Matlab (or whichever programming language of your choice) program to randomly sample parameter hypotheses, and compare the data to the model prediction, and calculate the negative log likelihood statistic. The program needs to do this sampling procedure many, many times, storing the parameter hypotheses and negative log-likelihoods in vectors.

For the initial range to sample the parameters, choose a fairly broad range that you are pretty sure, based on your knowledge of the biology and/or epidemiology associated with the model, includes the true value. Do enough iterations that you can assess more or less where the minimum is (I usually do 1,000 to 10,000 iterations in initial exploratory runs if I’m fitting for one or two parameters… you will need more iterations if you are fitting for more than two parameters at once). Then plot the results without any kind of limit on the range of the y-axis. Check to see if there appears to be an obvious minimum somewhere within the range sampled (and that the best-fit value does not appear to be at one side or the other of the sampled range). Using a broad range can also give you an indication if there are local and global minima in your likelihoods.

If the best-fit value of one of the parameters appears to be right at the edge of the sampling range, increase the sampling range and redo this first pass of fitting. Keep on doing that until it appears that the best-fit value is well within the sampled range.

Once you have the results from these initial broad ranges, when plotting the results restrict the range on the y axis by only plotting points with likelihood within 1000 of the minimum value of the likelihood. From these plots record the range in the parameter values of the points that satisfy this condition.

Now do another run of 1000 to 10,000 iterations (more if fitting for more than two parameters), sampling parameters within those new ranges (again, re-adjusting the range and iterating the process if necessary if the best-fit value of a parameter is at the left or right edge of the new sampled range).

Then plot the results, this time only plotting points with likelihood within 100 of the minimum value of the likelihood. Record the new range of parameters to sample, and run again.

Then plot the results of that new run, plotting points with likelihood within 15 of the minimum value of the likelihood and determine the new range of parameters to sample.

Then do a run with many more Monte Carlo iterations such that the final plot of the likelihood versus the sampled parameters is well populated (aim for hundreds of thousands to several million iterations… this can be achieved using high performance computing resources). Having many iterations allows precise determination of the best-fit values and one standard deviation uncertainties on your parameters (and 95% confidence interval).

Example: fitting to simulated deer wasting disease data

Chronic wasting disease (CWD) is a prion disease (akin to mad cow disease) that affects deer and elk species, first identified in US deer populations in the late 1960’s. The deer never recover from it, and eventually die. There has yet to be a documented case of CWD passing to humans who eat venison, but lab studies have shown that it can be passed to monkeys.

CWD is believed to be readily transmitted between deer through direct transmission, and through the environment. There is also a high rate of vertical transmission (transmission from mother to offspring). CWD has been rapidly spreading across North America.

In this exercise, we’ll assume that the dynamics of CWD spread are approximately described by an SI model, with births and deaths that occur with rate mu:

where N=S+I, and beta is the transmission rate.

Setting the time derivates equal to zero and solving for S yields that the possible equilibrium values of S* are S*=N (the disease free equilibrium) and S*=N*mu/beta (the endemic equilibrium). Note that we can express beta as beta=N*mu/S* = mu/(1-f) where f is the endemic equilibrium value of I

Let’s assume that officials randomly sample 100 deer each year out of a total population of 100,000, and test them for CWD (assume that the test doesn’t kill the deer) to estimate the prevalence of the disease in the deer population. The file cwd_deer_simulated.csv contains simulated prevalence data from such sampling studies carried out over many years.

The R script fit_si_deer_cwd_first_pass.R does a first pass fitting to this data, for the value of f=I*/N and the time of introduction, t0. The time of introduction clearly can’t be after the data begin to be recorded, so the script samples it from -50 years to year zero. We also don’t know what f is so we randomly uniformly sample it from 0 to 1. The script does 1000 iterations, sampling the parameters, calculating the model, then calculating the Binomial negative log-likelihood at each iteration, and produces the following plot:

f=I*/N has a clear minimum, but the time of introduction is perhaps not so obvious. Let’s try zooming in, only plotting points with likelihood within 1000 of the minimum value:

Here we can clearly see that the best-fit value of t0 is somewhere between -50 and zero (recall that it can’t be greater than zero because that is where the data begin). The best-fit value of I*/N appears to be around 0.70, but the parabola enveloping the likelihood in the vicinity of that minimum appears to be highly asymmetric. When choosing our new sampling ranges, we want to err on the side of caution, and sample past 0.73-ish to make sure we really have the minimum value of the likelihood within the new range.

In the file fit_si_deer_cwd_second_pass.R we do another 10,000 iterations of the Monte Carlo procedure, but this time, based on the above plot, sampling t0 from -50 to zero, and I*/N from 0.5 to 0.75. The script produces the following plot where it zooms in on points for which the likelihood is within 100 of the minimum value of the likelihood:

Based on these plots, it looks like in our next run, we should sample the time of introduction from -20 to zero, and sample I*/N from around 0.62 to 0.75. Going up to 0.75 may again be overly cautious in over-shooting the range to the right, but we’ll do another run, then adjust it again if we need to. The file fit_si_deer_cwd_third_pass.R does 10,000 more iterations of the Monte Carlo sampling, using these ranges, and produces the following plot only showing points that have likelihood within 15 of the minimum value:

Now, in a final run, we can sample t0 from -8 to zero, and I*/N form 0.67 to 0.72. The R script fit_si_deer_cwd_fourth_pass.R does this, and produces the following plot:

From this output we can estimate the best-fit, one standard deviation uncertainty (range of parameter values within fmin+1/2), and 95% confidence intervals (range of points within fmin+0.5*1.96^2).

Reducing the size of file output

Once you have narrowed in on the best parameter ranges with the above iterative method, there are still many parameter hypothesis combinations that fall within those ranges that yield large likelihoods nowhere near the minimum value (especially if you are fitting for several parameters at once). When running R scripts in batch to get well-populated likelihood-vs-parameter plots to ensure precise estimates of the best-fit parameters and uncertainties, this can lead to really large file output if you output every single set of sampled parameter hypotheses and likelihood.

To control the size of file output, what I’ll often do is only store parameter hypotheses and likelihoods for which the likelihood was within, say, 100 of the minimum likelihood found so far by that script.

Graphical Monte Carlo parameter opimisation: Uniform random sampling

Sherry Towers — Mon, 25 Feb 2019 18:33:02 +0000

In this module, we will discuss the graphical Monte Carlo parameter optimisation procedure using Uniform random sampling of the parameter hypotheses, and compare and contrast this method with the graphical Latin hypercube method.

Once you have chosen an appropriate goodness-of-fit statistic comparing your model to data, you need to find the model parameters that optimise (minimise) the goodness-of-fit statistic. The graphical Monte Carlo Uniform random sampling method is a computationally intensive “brute force” parameter optimisation method that has the advantage that it doesn’t require any information about the gradient of the goodness-of-fit statistic, and is also easily parallelizable to make use of high performance computing resources.

In this module, we discussed a related method, graphical optimisation with Latin hypercube sampling. With Latin hypercube sampling, you assume that the true parameters must lie somewhere within a hypercube you define in the k-dimensional parameter space. You grid up the hypercube with M grid points in each dimension. In the center of each of the M^k points, you calculate the goodness-of-fit statistic. With this method you can readily determine the approximate location of the minimum of the GoF. Because nested loops are involved, with use of computational tools like OpenMP, the calculations can be readily parallelized to some degree. However, if the sampling region of the hypercube is too large, you need to make M large to get sufficiently granular estimate of the dependence of the GoF statistic on the parameters. Several iterations of the procedure are usually necessary to narrow down the appropriate sampling region.

If you make the sampling range for each of the parameters large enough that you are sure the true parameters must lie within those ranges, with this method you can be reasonably sure that you are finding the global minimum of the GoF, rather than just a local minimum. This is important, because very often in real epidemic or population data, there are often local minima in GoF statistics.

Because it is difficult to parallelise Latin hypercube sampling, and there is discrete granularity in the sampling, a method which is preferable is randomly Uniformly sampling the parameters over a range. The method is easy to implement, and easy to parallelise. With this method, parameter hypotheses are randomly sampled from Uniform distributions in ranges that would be expected to include the best-fit values. The goodness-of-fit statistic is calculated for a particular set of hypotheses, and along with the hypotheses stored in vectors. The procedure is repeated many, many times, and then the goodness-of-fit statistic is plotted versus the parameter hypotheses to determine which values appear to optimise the GoF statistic.

Example: fitting to simulated pandemic influenza data

We’ll use as example “data” a simulated pandemic flu like illness spreading in a population, with true model:

with seasonal transmission rate

The R script sir_harm_sde.R uses the above model with parameters 1/gamma=3 days, beta_0=0.5, epsilon=0.3, and phi=0. One infected person is assumed to be introduced to an entirely susceptible population of 10 million on day 60, and only 1/5000 cases is assumed to be counted (this is actually the approximate number of flu cases that are typically detected by surveillance networks in the US). The script uses Stochastic Differential Equations to simulate the outbreak with population stochasticity, then additionally simulates the fact that the surveillance system on average only catches 1/5000 cases. The script outputs the simulated data to the file sir_harmonic_sde_simulated_outbreak.csv

The R script sir_harm_uniform_sampling.R uses the parameter sweep method to find the best-fit model to the simulated pandemic data in sir_harmonic_sde_simulated_outbreak.csv. Recall that our simulated data and the true model looked like this:

The sir_harm_uniform_sampling.R script randomly Uniformly samples the parameters R0, t0, and epsilon over many iterations and calculates three different goodness-of-fit statistics (Least squares, Pearson chi-squared, and Poisson negative log-likehood). Question: this is count data… which of those GoF statistics are probably the most appropriate to use?

Here is the result of running the script for 10,000 iterations, sweeping R0, the time of introduction t0, and epsilon parameters of our harmonic SIR model over certain ranges (how did I know which ranges to use? I ran the script a couple of times and fine-tuned the parameter ranges to ensure that the ranges actually contained the values that minimized the GoF statistics, and that the ranges weren’t so broad that I would have to run the script for ages to determine a fair approximation of the best-fit values). Question: based on what you see in the plots below do you think 10,000 iterations is enough? Also, do all three GoF statistics give the same estimates for the best-fit values of the three parameters? Which fit would you trust the most for this kind of data?

This method is known as the “graphical Monte Carlo method”. “Graphical” because you plot the goodness-of-fit statistic versus the parameter hypotheses to determine where the minimum is, and “Monte Carlo” because parameter hypotheses are randomly sampled.

The uniform random sampling method is easily parallelizable to make use of high performance computing resources, like the ASU Agave cluster, or NSF XSEDE resources.

Here is a summary of the results of running the sir_harm_uniform_sampling.R script in parallel using the ASU Agave high performance computing resources (note that the density of points in the plot below are what you are aiming to achieve when using the graphical Monte Carlo procedure… the plots above are too sparse to reliably estimate the best-fit parameters and their uncertainties!). Note that the Pearson chi-squared, least squares, and Poisson negative log likelihood statistics all predict somewhat different best-fit parameters. We were fitting to count data… which of these goodness of fit statistics is likely the most appropriate?

In addition to being easily parallelisable, the graphical Monte Carlo method with Uniform random sampling also has the advantage that Bayesian priors on the model parameters are trivially applied post-hoc to the results of the parameter sampling.

Protected: AML 612 Spring 2019: project prospectus list and scoring rubrics

Sherry Towers — Sun, 24 Feb 2019 17:00:30 +0000

Protected: Running R in batch with ASU high performance computing resources. Note: for latest info on how to run R in batch, contact Gil Speyer (speyer@asu.edu)

Sherry Towers — Wed, 20 Feb 2019 02:18:55 +0000

Contagion models with non-exponentially distributed sojourn times in the infectious state

Sherry Towers — Tue, 19 Feb 2019 17:17:58 +0000

Compartmental models of infectious disease transmission inherently assume that the time spent (“sojourn time”) in the infectious state is Exponentially distributed. As we will discuss in this module, this is a highly unrealistic assumption. We will show that the “linear chain rule” can be used to incorporate more realistic probability distributions for state sojourn times into compartmental mathematical models.

A simple example of a compartmental model of infectious disease spread is the Susceptible, Infectious, Recovered model. In several past modules we have discussed this model in detail, but briefly, individuals in the susceptibly compartment can be infected on contact with infectious people in the population (whereupon they flow to the “infectious” compartment). Infectious people recover with some rate, gamma, and flow into the “recovered and immune” compartment.

The compartmental diagram for the model looks like this:

and the system of ordinary differential equations describing these dynamics is:

Inherent in these model equations is the assumption that the sojourn time in the infectious compartment is Exponentially distributed, with rate gamma. You can see this if you look near the beginning of the outbreak where I is approximately equal to zero, in which case the equation for dI/dt is dI/dt=-gamma*I. The solution to this equation when I goes to zero as t goes to infinity is I(t) = I_0 exp(-gamma*t).

The probability distribution for the sojourn time in the infectious state for this model thus looks like this:

Notice that the most probable time for leaving the infectious state is time t=0. This implies that the most probable time that you will recover from a disease like influenza or measles is immediately after being infected…. this is clearly high unrealistic! For all diseases, realistically, the probability distribution for the sojourn time in the infectious state looks more like a bump, like this:

With this distribution, at time t=0, the probability of leaving the infectious state is zero (as it is in reality). A few people recover early on after being infected, but most of those infected recover near the middle of the bump. A few take much longer to recover, and are in the tails of the distribution.

So, how can we incorporate realistic sojourn times like this in compartmental models? And why would we even want to? (hint: think about control strategies, like treatment or isolation, that might be aimed at people at various times after they are first infected… there is often a delay between time of infection and the application of an intervention strategies).

Gamma distributed sojourn times

It turns out the Gamma distribution offers an easy way to incorporate realistic sojourn times in a model. The Gamma distribution has two parameters, a shape parameter, k, and a scale parameter theta. The mean of the distribution is mu=k*theta. The probability density function for the Gamma distribution is:

When k is an integer, the distribution is called the Erlang distribution, and for the special case when k=1, the distribution is the Exponential probability distribution. It turns out that an Erlang distributed random number with scale parameter theta is the sum of k Exponentially distributed random numbers with rate theta. Here is an example of how the parameter k affects the shape of the Erlang distribution when scale parameter theta=1/k (and thus the mean of the distribution is one):

Notice that the higher the value of k, the more narrow and peaked the distribution is.

Linear chain trick

Because an Erlang distributed random number with scale theta and shape k is the sum of k Exponentially distributed random numbers with rate theta, there is a method called the “linear chain trick” that adds k disease stages to a compartmental model, each of which flows into the next with rate k*theta (except for the last which flows to the recovered class), where 1/theta is the average infectious period for the disease.

For the SIR model, if we assume the rate is gamma, we get

The R script sir_erlang_sojourn.R shows an example of how to code up a linear-chain model in R. It requires that the sfsmisc and deSolve libaries have been installed on your computer. If they have not, type in your R console

install.packages("sfsmisc","deSolve")

and choose an R repository mirror site close to your location for the download. Then type in your R console

source("sir_erlang_sojourn.R")

This produces the following plot:

The higher the value of k, the narrower the peak of the outbreak. The script also prints out the final size for the various values of k: the final size of the outbreak is independent of k.

In the absence of births and deaths (vital dynamics) in the model, the reproduction number for this model is exactly the same as a model that that just assumes k=1. That is to say, R0=beta/gamma.

This paper discusses the relationship between R0, k, gamma, and the rate of exponential rise at the beginning of an outbreak for SIR and SEIR models.

Example LaTex and BibTex documents

Sherry Towers — Sat, 02 Feb 2019 22:04:26 +0000

In this module, I provide an example LaTex document that cites references within a BibTex file, and also includes examples of how to include equations, figures, and tables.

The files for this worked example can be found in my GitHub repository https://github.com/smtowers/example_latex

The repository contains the main LaTex document example_latex.tex, along with the bibtex file example_latex.bib. In order to compile the document, you also need to download the example_latex_histogram_plot.eps, which is the figure included in the file. To compile the document, run LaTex once, then BibTex, then LaTex twice (which should resolve all references).

This should produce the file example_latex.pdf

Note that the encapsulated postscript (EPS) figure for the paper was produced with the R script example_latex.R (you need to install the R extrafont library before running the script) The R script also shows you how to automatically output results from your analysis code that can be included as \newcommands in your latex file that allow you to copy and paste the results to your LaTex file so that reference those results in the text of your paper without having to manually transcribe numbers (which can lead to unnecessary transcription errors).

Data and R code repositories in GitHub

Sherry Towers — Fri, 01 Feb 2019 02:32:50 +0000

GitHub is a web-based version-control and collaboration platform for software developers.

Git, an open source code management system, is used to store the source code for a project and track the complete history of all changes to that code. It allows developers to collaborate on a project more effectively by providing tools for managing possibly conflicting changes from multiple developers. GitHub allows developers to change, adapt and improve software from its public repositories for free. Repositories can have multiple collaborators and can be either public or private.

GitHub facilitates social coding by providing a web interface to the Git code repository and management tools for collaboration.

Because GitHub is intuitive to use and its version-control tools are useful for collaboration, non-programmers have also begun to use GitHub to work on document-based and multimedia projects.

Three important terms used by developers in GitHub are fork, pull request and merge. A fork, also known as a branch, is simply a repository that has been copied from one member’s account to another member’s account. Forks and branches allow a developer to make modifications without affecting the original code. If the developer would like to share the modifications, she can send a pull request to the owner of the original repository. If, after reviewing the modifications, the original owner would like to pull the modifications into the repository, she can accept the modifications and merge them with the original repository.

In the following, we’ll talk about GitHub at it’s simplest: as a repository for data files you might want to read into R, and also as a repository for R library packages you might develop. I won’t talk about the finer points of versioning here…. just the basics of how to create your own GitHub repository and upload files to it via the online interface.

GitHub data repositories

My primary use of GitHub is as a repository for data files that I want to share with others, and that can be read by R Shiny visual analytics scripts that I develop (although I can also incorporate the data files as part of the R Shiny application, so it doesn’t necessarily need to be in a repository like GitHub for this purpose). I could, of course, use Dropbox to share my files, but GitHub allows me to write descriptions of them, and also makes them searchable online.

For example, on my GitHub account, I have a data repository: https://github.com/smtowers/data

In this repository, I have several files that I share publically, including the file Geneva_1918_influenza.csv, which is the daily incidence of influenza hospitalisations in Geneva, Switzerland during the 1918 influenza pandemic. The raw file can be found here. Putting this file on my GitHub repository allows me to share it publicly with whomever might want it simply by giving them the URL. Importantly, I can also read the file directly from GitHub within an R script. To try this out yourself, within the R console, type:

fname = "https://raw.githubusercontent.com/smtowers/data/master/Geneva_1918_influenza.csv"
thetable = read.table(fname,header=T,as.is=T,sep=",") 
plot(thetable$num)

This also allows me to access the files in R Shiny scripts running off of servers like the shinyapps.io server, and to share the data file with whomever else might to want to use it in their analysis or applications.

An R Shiny script that I have written that uses this data can be found at https://sjones.shinyapps.io/geneva/ The app reads in the data, plots it, and then overlays the predictions of an SIR disease model with seasonally forced transmission, with parameters input by the user via slider bars. In another module, I talk about how to create your own R Shiny applications (which may or may not read data from GitHub).

Creating a GitHub account

Creating a GitHub account is simple and free. Go to github.com and click on “Sign Up For GitHub”. Once you have the account, sign in. To create a new repository, click on the green “New” button at the left hand side of the page:

When the dialogue window pops up, give your repository a name and short description, and click the “Initialize this repository with a README” box:

Click “Create Repository”.

You now have a blank repository, ready to be filled with your files. To upload a file, click on the “Upload files” tab near the upper right:

It will take you to a dialogue box where you can choose the file you want to upload from your computer. Choose your file. Then a dialogue box opens asking you to fill in a description of the file:

Once you click “Commit Changes” your file will now appear in your GitHub repository.

Should you want to update the file in the future, simply repeat the process, starting with “Upload file”. If you upload a file with the same name as a file already in the main branch of the repository, it will be over-written.

Making your own R library packages in GitHub

It is remarkably easy to upload your own R code to GitHub as an R library package that others can download and install. This website gives the complete guide to doing that, and is in fact the main resource I used to learn how to do this myself.

I created an R library, for example, with some code related to an analysis my colleagues and I did quantifying the average number of infections that descend down the chain-of-infection of a person infected during an outbreak. Those include the people that person directly infects, plus the number those go on to infect, plus the number those go on to infect, and so on until the chain-of-infection eventually dies out. We called this quantity the “average number of descendant infections”, or ANDI. With ANDI, we can quantify the average probability that at least one person ends up hospitalised down the chain-of-infection from an unvaccinated person infected in an outbreak of vaccine preventable diseases like measles (turns out, that probability is almost 100% in locations where vaccine coverage is sub-standard).

Our analysis code would likely be of interest of others, so we made an R library package of the methods to make it easy for people to download and use (we called the package “ANDI”). We also mentioned the package in our paper. To install the package yourself from GitHub (or any other R library package you find on GitHub, and there are many), install the devtools package on R:

install.packages("devtools")

then type:

require("devtools")
install_github("smtowers/ANDI")
require("ANDI")

There is example code showing how to use the methods in the package in https://github.com/smtowers/ANDI/blob/master/example.R

Visual analytics with R Shiny

Sherry Towers — Fri, 25 Jan 2019 17:59:24 +0000

In this module, students will learn about the rapidly growing field of visual analytics, and learn how to create their own online visual analytics applications using the R Shiny package.

What is the field of “visual analytics”?

(see original source here)

Visual analytics (or “viz”) involves the development of interactive tools that facilitate analytical reasoning based on visual interfaces. The idea behind viz is to put data and/or models into the hands of the public, policy makers, other researchers, or other stakeholders, and allow them to visually examine the data or your models, integrate their own knowledge and perform their own selections to help them reach conclusions of importance to them. It can help to solve problems for which their size and complexity, and/or need for real-time expert input would otherwise be intractable.

For example, say there is an animal pathogen that has the potential to be used in a bio-terrorism attack against the farming economy. You may have developed a meta-population dynamical model of disease spread at that includes the spread of the pathogen among domesticated and wild animals in local areas, and also spread of the pathogen across borders (for example because the animals are transported or move between areas, or people carry the pathogen on their shoes or clothing). For public officials, who might have very few options at their disposal to stop the outbreak, you could, for example, develop a visual application that shows a map of the progression of the outbreak in the areas, with the intensity of the map colours indicating the prevalance of infection in that area at a particular time step.

You could provide tools that allow the public officials to visually examine the relative efficacy of different control strategies, based on their knowledge of what is actually feasible. Things like culling animals in and around the initially infected areas, or perhaps examining how limited vaccine stores might be employed, or limiting the transport of animals, or sanitation of the boots and clothing of farm workers or veterinarians, or stopping of travel across borders altogether. The visual analytics application allows officials to combine their expert knowledge and expertise with the model predictions via the visual interface in order to reach optimal solutions under multiple constraints, particularly constraints that might change in time.

Lots of examples of visual analytics applications produced by various companies or organisations can be found online. For example here, here, and here.

Visualisation applications can involve quite complex integrated high-level coding environments, and may involve different kinds of output to several different screens simultaneously, such as the system used by the Decision Theater at Arizona State University. Designing maximally effective visual analytics apps is based on quantitative analytics, graphic design, perceptual psychology, and cognitive science.

" scrolling="yes" class="iframe-class">

However, visual analytics applications need not necessarily be complex to be impactful. For example, applied mathematicians in the life and social sciences use dynamical models in analyses for quantification and prediction; relatively simple visual analytics applications can put those models and associated data (if relevant) into the hands of policy makers to allow them to examine how the model predictions change under different initial conditions or with different parameters. In addition, it means that people don’t have to rely on just the two dimensional plots you put in a publication… given the URL of an online visual analytics app associated with the analysis, they can go to that app and further examine the model and data themselves.

More and more, I try to integrate visual analytics into my own research, because I believe it has the potential to make my research much more impactful. In addition, it provides me with a way to share my data, and to make my analysis methodologies as transparent as possible.

I am also finding that development of visual analytics apps for my own use is quite useful… it is remarkably helpful, for example, to have slider bars for data selection or model inputs and examine how the analysis results or model predictions change when my assumptions change. I find this much easier compared to constantly repeating the process of editing a program and re-running it.

Some examples of visual analytics or code-and-data-sharing frameworks

A “visual analytics” application is simply any application that allows users to interact with data and models, and associated analysis methods (like fitting methods, for example). In this sense, any programming language that allows you to make interactive plots allows you to write visual analytics applications. However “online visual analytics” applications are ones that are hosted online, and do not require any specialised software for the user to run (other than a web browser).

In the following, I’ll mention several different software packages that allow the creation of interactive applications. Not all, however, provide the potential for online hosting of the applications. And this is certainly not an exhaustive list of all tools that are out there.

Code sharing with Mathematica notebooks

Students may already have some experience with applications they might have shared with others in a “notebook” format. For example, Mathematica notebooks allow users to share Mathematica code, which might include dynamical interactive selection criteria provided by user-driven slider bars or radio buttons that allow the user to examine the code output or plots under different selection criteria. Mathematica notebooks are typically shared by email or by posting the code in the cloud for others to download and then use themselves with Mathematica running on their computer. Mathematica is not free software and requires a site license. However, they have a free application called the Wolfram CDF player that allows users to examine Mathematica notebooks.

Code sharing of interactive scripts with Matlab

Matlab also allows for incorporation of user-interface controls like slider bars (for example) in Matlab scripts that can then be shared with others via email, or by posting the code in the cloud for others to download and then use themselves within Matlab running on their computer. Matlab is not free software, and requires a site license. However, GNU Octave is free software that allows users to run Matlab scripts (but Matlab users can’t necessarily run Octave scripts).

Plotly visual analytics package in Python (allows online sharing)

The Plotly (or Plot.ly) package in the Python programming language allows users to make online graphing applications, and provides free online hosting for applications. Examples of Plotly interactive and non-interactive applications can be searched for here. In my opinion, Plotly has similarities to the R Shiny package that I describe below, but I have noted that most of the example applications I have so far come across online are non-interactive for some reason.

Sharing code and data in Julia, Python and R with Jupyter

Jupyter is free open-source software widely used in industry, and is an integrated data management and code development environment, interfacing to several different programming languages (including Julia, Python, and R… which is where Ju-pyt-er in fact got its name) that you can download to your computer, and it allows display of the results of a programming script that is on your local computer (which might involve interactive elements), or off of a website that Juypter loads and then allows to run on your computer. Here is an example of an R script being run through the Jupyter interface. In 2014, Nature wrote an article discussing the advantages that of the integrated code and data development environment provided by Jupyter.

Visual analytics with Tableau

Tableau allows for the creation of nice visual analytics apps with a simple drag-and-drop interface, and is quite popular in the business intelligence community. Unfortunately it is lacking in quantitative and computational tools we would typically use in dynamical modelling analyses (such as ODE and PDE solvers, or delayed differential equation solvers). Tableau is not free, and requires a site license.

D3 JavaScript library for creation of online interactive visual analytics

D3 is a JavaScript library for producing static or interactive visualisations in web browsers, and is widely used for many data visualisation applications, including by many online news sites. Some of the applications are quite fun to play with (even if sometimes it is unclear what the point was, other than the app looks cool). However, from an applied mathematics perspective, D3 suffers from much of the shortcomings as Tableau in the sense that there are no canned methods available to numerically solve the kinds of equations typically involved in the dynamical models we use; it is possible to write your own JavaScript methods to solve ODE’s, PDE’s, etc, but it is a significant amount of computational overhead on your part.

Visual analytics with Infogram

Infogram is a website that allows you to create online visual analytics dashboards via an intuitive drag-and-drop interface (you can also create static infographics like plots, pie charts, etc, with no user interactivity). The dashboards can then be shared with others via a URL. Signing up for an account is free, and allows you to host up to 10 dashboards on their site. There are paid options that allow you to host more. While the development interface is simple and intuitive, and the site allows the potential development of nice, relatively uncluttered looking visual analytics, there are no statistical, numerical and computational tools that allow for more complicated modelling-related visual analytics applications.

Visual analytics applications with Flourish

Flourish is another website that allows you to create simple online visualisation applications, again with a drag-and-drop interface, and again with free hosting. And again, the application can be shared via URL. The functionality appears to be even more basic than that provided by Infogram, and again has no statistical, numerical or computational tools that allow for more sophisticated applications. Some examples of visual analytics applications in Flourish can be seen here.

R Shiny library for creation of online interactive visual analytics

One advantage of examining visual analytic application examples on Flourish, and Infogram, and on other software platforms, is that you can get ideas on how to convey information in your own visual analytics applications in a clean, elegant looking format. However, as we saw above, most other visual analytics applications suffer from the drawback that you either have to pay for the software, or the software simply is not sophisticated enough for use in dynamical modelling applications, or in other more complicated statistical analysis applications.

R Shiny is free and open-source, and is part of the R programming language, and thus is integrated with the vast powerhouse of statistical, numerical and computational methods that is R. It allows for development of visual analytics applications that can can be hosted off of websites like shinyapps.io, or off of your own server if you have installed the R Shiny Server (ASU unfortunately does not yet have this hosting ability, but I’m working on seeing if it can happen). The website shinyapps.io allows you to host up to five apps at a time, and gives you 25 hours per month of interactive access to them. This is plenty for most applications, unless you create a very popular application (in which case, shinyapps.io also allows paid plans that offer broader access options).

Of course, you could always opt to share your R Shiny applications with other R users by sending them your app code via old-fashioned email or in the cloud, but hosting off of a site like shinyapps.io means that all you have to do is give a user the URL of your app, rather than requiring them to have R installed, download the R code from their email or the cloud, and then run the app within R.

There are many online examples of R Shiny visual analytics applications that range from fairly simple to more fancy, fancier, and quite fancy.

Here is an example of an R Shiny app that allows you to examine movie reviews in the Rotten Tomatoes database (best viewed on its own hosting website):

And here is an example of an R Shiny app I wrote that reads in 1918 daily influenza hospitalisation data for Geneva, Switzerland during the 1918 influenza pandemic, and overlays the predictions of an SIR model with seasonal transmission, with model parameters provided by the user via slider bars (the app is best viewed off of the shinyapps.io website where it is hosted):

Anatomy of an R Shiny application

To use R shiny, you first have to download the shiny library. In the R console, type:

install.packages("shiny") 
require("shiny")

An R shiny application is built on two building blocks that the runApp() function in the R shiny library uses to create an interactive web browser application running locally on your computer (and are the files you will need to upload your application to an R Shiny external server to share online with others):

Code in a file called ui.R that defines the ui() function is the user interface function that sets up the page layout, and defines the user inputs via text boxes, radio buttons, slider bars, etc
Code in a separate file called server.R that defines the server() function that takes the inputs from ui.R and makes selections on the data, and/or creates plots of the data, and/or creates plots of your model with the input parameters, and/or tabulates things, and/or lets the user download the data, etc etc etc.

You can also have server.R source other R files off of the working directory where you’ve written your own functions to do various things. It can also read data from files in your working directory. As we will see in a bit when we get to that point, when you deploy the app online, these files will automatically get uploaded to the server.

Probably the best way to get an initial understanding of what the ui.R and server.R files do is to look at an example application. I’ve put the R files related to the Geneva 1918 influenza app in my GitHub repository: https://github.com/smtowers/geneva

Create a working directory on your computer, and from my GitHub repository download the files ui.R, server.R, and geneva_utils.R. The geneva_utils.R file just contains a bunch of functions used by the app that I didn’t want littering my server.R file. In the server.R file, notice that I load the functions in geneva_utils.R file with the line:

source("geneva_utils.R",local=T)

Whenever you source another R file in a shiny app that is part of your shiny app package, use the local=T option.

In R, change to that working directory, and type:

require("shiny")
runApp()

A window should have opened up in your web browser, with the app running in it. This is running locally on your computer, using the code you just downloaded. In order to make an app public to other people online, you need to upload it to an R Shiny server (which is really easy… instructions on how to do that are below).

Creating a shinyapps.io account to host your R Shiny applications

Shinyapps.io is a website hosting service that allows hosting of up to five R Shiny apps per person for free. They have paid options if you need to host more apps.

Got to shinyapps.io and set up a free account, and then login. You will be presented with these welcome pages:

Follow the instructions in steps 1 and 2 (you need to click the button “Show Secret” before you copy the R code to paste into R). You only need to do steps 1 and 2 once per computer you will be using R from. You might want to save that code snippet for later reference if you will be running R from multiple computers.

Step 3 will be coming up when you deploy your first app….

Deploying the 1918 Geneva influenza application to your own shinyapps.io account

Make sure that R is in the working directory to which you downloaded the ui.R, server.R and geneva_utils.R files. Now, in R, type

require("rsconnect") 
deployApp(account="
The deployApp() method will automatically build any library packages your app depends on, and then upload all your files to the server.  Note: if the server times out, just repeat the deployApp() command.  Note that deployApp() can be used over and over again whenever you update the code to upload the newest version of your app.
Once the app is finished deploying, it will pop up in your web browser.  You can now share the URL with whomever you like.  The app URL will look like:
https://.shinyapps.io/




Deleting or archiving R Shiny apps on shinyapps.io
If you would like to delete or archive old shiny apps, simply log in to your shinyapps.io account, and go to https://www.shinyapps.io/admin/#/applications/all
It will list your apps, and you will see icons giving you options to delete the app or archive it.
Including data files in your R Shiny app
Your R shiny app can either read in web-based files in a GitHub or Dropbox repository, using commands like this:
fname = "https://raw.githubusercontent.com/smtowers/data/master/Geneva_1918_influenza.csv" 
thetable = read.table(fname,header=T,as.is=T,sep=",")
or
fname = "https://www.dropbox.com/s/drn0nqnn8a85c7t/Geneva_1918_influenza.csv?dl=1"
thetable = read.table(fname,header=T,as.is=T,sep=",")
Or, you can simply make a data subdirectory off of your working directory where you are developing your shiny app, put your data file in there, and read it in with your R shiny code with a command like:
fname = "./Geneva_1918_influenza.csv"
thetable = read.table(fname,header=T,as.is=T,sep=",")
When you deploy your app to a server like shinyapps.io using deployApp(), the data file will be uploaded with the rest of the files in the package.
Trouble shooting R shiny apps
When running your app locally, usually there will be error messages printed to your R terminal if the app has problems running.  These will usually point you to specific lines in your files where there is a problem.  A common problem is forgetting to put commas after various layout elements in your ui.R file.
If your app runs fine locally, but once you upload it you get the error message “Error: An error has occurred. Check your logs or contact the app author for clarification”, this can often be a sign that you forgot to explicitly load the R libraries needed by your script within the script itself using require() or library() statements.  So, while at some point you might have loaded them in your R session when running some other script, the shiny app doesn’t know about them once it has been uploaded to the server.
To get more information as to what is causing the app not to run, log in to your shinyapps.io account.  Then click on the Applications tab on the right to see your list of apps.

For example, if my bird_market app was having issues, I can click on that app, and the following page appears:

I can now click on the Logs tab, and a dialogue will appear showing the informational and error messages.  Look for warning or error messages in the logs, and fix whatever issues you see in your app code.  For example, in the log file for bird_market, I saw this:

 
It was apparent that I was referencing a parameter eta without it first being defined.  After I fixed that issue, the app ran fine (you can see it at https://sherrytowers.shinyapps.io/bird_market/).
Another way to catch problems with your apps before uploading them to shinyapps.io is to set
options(warn=2)
on the command line in R, then run your app locally.  What this does is causes problems that would normally be just a warning to instead cause R to fail.  This can help you to diagnose where problems are occurring using methods like traceback().
 
Styling your apps
The basic shiny interface has a fair amount of flexibility in page layout, etc.  If you want a greater array of fonts and colours, etc, you can use css style sheets, following the instructions here.