There’s No Glass Ceiling if you Leave the Room

My friends have been asking me why I resigned from my current job, with no job lined up.

It started when ants found their way into our home. I could see a solid trail of ants determinedly racing towards a destination (turns out someone had spilled soda underneath the fridge).

The ants looked something like this:

Marching in a single file, the ants were focused and targeted on their sugar goldpile at the end of their trail. There were many of them, and they were undeterred. Like in a well-established field of research, everyone raced toward a final goal because they knew the reward was guaranteed.

After we cleaned up the soda, we saw a huge dropoff of ants in our home. However, one explorer ant (“scout ant” if you’re an entomologist) found a fragrant apple, and established a sparse trail leading a few other pioneering ants to our apple. This is akin to finding a promising, new research area. You’re not guaranteed success, but if you do, then many ants will eventually follow.

I have been lucky to be an explorer ant throughout most of my career. I established burgeoning fields and developed disruptive technologies.

The Job I Left Behind

When I started my job at GIS over 7 years ago, my job was funded by so-called “hard money” — guaranteed funding — which allowed me to be an explorer ant. I was given resources and allowed to try new ideas, so I executed some pretty interesting projects. Besides continuing research in genetic variation [1,2,3], I studied whether one could use online drug reviews to measure drug performances [4], created a framework for emancipating facts from papers locked behind paywalls (think SciHub with a legal theory) [5], and even looked at the patterns of Facebook memes [6].

However, the funding model at the institute changed; it started to transition to grant funding. The grant funding scheme was not guaranteed for a PI; funding instead was allocated top-down and focused on directed, big-money topics such as heart disease and cancer.

When I started writing grants, I found myself using buzzwords and adding unnecessary tasks to pad grants. Rather than designing the project the way it should be, I would bloat projects to consume the available funding.

The environment was changing, too. When resources are tight, people’s behaviors change. I looked around and saw certain behaviors emerging as people struggled to survive – this is not what I wanted to become.

Inevitably, I started to change: my science was changing and my efficiency was changing. And I didn’t like this change.

Large projects were winning funding, but I couldn’t be passionate about them because they were not risky or groundbreaking enough for me. Funding agencies were after success metrics like number of patents filed and licensees signed. They wanted a sure win, just like the ants rushing to our sticky soda sugar mine.

However, going into a heavily researched, well-established area? It’s just not me.

I could no longer be an explorer ant, and was worried I’d become:


I knew what I didn’t want to become, but what did I want to become?

I went through a lot of soul-searching. I applied for jobs, but nothing seemed to click.

A couple of incidents happened around this time.

  1. I heard a female entrepreneur speak, which is rare sight. She was realistic, open, and honest about the challenges in her life. What impressed me was that she had defined her own unique path.

    Instagram: lsjourney

  2. On a long flight, I watched the movie Moana on my dinky little airplane TV. It is about a girl finding her own identity (and saving her village in the process). These are my favorite lyrics in the whole movie:Like Moana, I have “journeyed farther” and have accomplished a lot in my career. My environment may change, but I should not let it change me. I still want to explore, to push boundaries, and to do ground-breaking things.
  3. My partner heard me singing “Moana” so often that he knew the lyrics without seeing the movie. Out of love, or perhaps to salvage his own ears, he suggested I take some time off. The way he put it was “if I didn’t make you happy, you’d leave me in a heartbeat. So if your job doesn’t make you happy, why not take a break instead of holding out for that new right job to come along?”

So I resigned.

The Future

(This is the shortest section because my story is still being written…)

So yes, I am unemployed! I am taking a step back to look at the big picture. I ain’t gonna lie: it’s scary like jumping off a cliff without a safety net.

Instagram adrenaline.addiction

Since my resignation, I’ve been rediscovering me. I have worn many hats: industry, academic, and clinical. I have always been defined by a role; now I will define my own role.

I have a general strategy. I’ve accepted a few part-time consulting jobs that I’m excited about, because it lets me work on exciting projects in new areas with amazing people. I am also very lucky to be a bioinformatician because all I need to continue creating is a laptop and access to cloud-based supercomputers.


Being a consultant is like being a tour guide –bringing others to newly discovered areas. I’ll also use some time to explore the unknown.

Most importantly, I’m keeping time for myself. There are some personal things on my bucket list I want to tick off. Some are genomic and disease-related; some are not. But in the end, I want to find something that challenges me and grows me.

Sure, like any explorer, I may fail.

But I’ll never know if I don’t try, right?


1. SIFT missense predictions for genomes. Nature Protocols 11:1-9. [siftdna.org]
2. Phen-Gen: Combining Phenotype and Genotype to Predict Causal Variants in Rare Disorders. Nature Methods 11:935 [phen-gen.org]
3. Predicting the effects of frameshifting indels. Genome Biology 13:R9
4. Assessment of Web-Based Consumer Reviews as a Resource for Drug Performance S. Adusumalli, H. Lee, Q. Hoi, S.L. Koo, I.B. Tan, P.C. Ng (2015) Journal of Medical Internet Research 17:e211
5. Enabling Public Access to Non-Open Biomedical Literature via Idea-Expression Dichotomy and Fact Extraction. Association for the Advancement of Artificial Intelligence Workshop on Scholarly Big Data [FactPub]
6. Information Evolution in Social Networks. The 9th ACM International Conference on Web Search and Data Mining


Reverse engineering contingency (2×2) table from Odds Ratio (OR)

Given the odds ratio (OR), we will calculate the individual cells in the contingency table (a,b,c,d).

In yellow, I’ve highlighted what is known.
a,b,c, and d are unknown and what we want to calculate.

Odds Ratio = (a/c) / (b/d)

Cases Controls Total
Exposed a b total_exposed
Unexposed c d total_unexposed
Total total_cases

If you’re getting the OR from a paper, the paper usually has total_exposed, total_unexposed, total_cases,and total_participants.

In that case, you can derive a, b, c, and d.

Solving for a:

Cases Controls Total
Exposed a total_exposed – a total_exposed (a+b)
Unexposed total_cases – a total_unexposed – total_cases + a total_unexposed (c+d)
Total total_cases (a+c) total_controls (b+d) total_participants

So now, the equation for OR can be written in terms of a and the known numbers :

OR = (a * d) / (b * c)
OR = (a * (total_unexposed – total_cases + a)) / ((total_exposed – a) * (total_cases – a))

If you have the values for OR, total_exposed, total_unexposed, total_cases, and total_controls, you can solve for a</i> using the quadratic formula.

Once you solve for a, solving for b, c, and d is trivial.

Try it out!

 Deriving cells of 2×2 Contingency Table from Odds Ratio:

Enter values in yellow cells


Absent Present Totals
  Group 1  
  Group 2  


I came across this problem when reading an Alzheimer’s paper.

Looking at ApoE ε4 carriers (n=452), smokers have an OR of 1.97 for dementia compared to non-smokers.

Because this was a population study, I wanted to know how many smokers got dementia, and how many non-smokers got dementia. If I got the individual cells, I could calculate this.

Out of the 452 ApoE ε4 carriers, 207 were smokers (45.8% of 452) and 31 had dementia (6.9% of 452).

From this,

  • OR = 1.97
  • total_exposed = 207
  • total_unexposed = 245
  • total cases (those with dementia) = 31
  • total controls (without dementia) = 421

I plugged in the above calculator to get:

Dementia Non-Dementia Total
Smoking 19 188 207
Non-smoking 12 233 245
Total 31 421 452

In this population-based study, 9% (19/207) of the smokers had dementia while 5% (12/245) of the nonsmokers had dementia.


Alzheimer’s book: 100 Simple Things You Can Do to Prevent Alzheimer’s and Age-Related Memory Loss

The book “100 Simple Things You Can Do to Prevent Alzheimer’s and Age-Related Memory Loss” was written in 2010.

You can find all of the advice on the Internet, and in more succinct terms.

The author Jean Carper lists “100 actions” you can take to prevent Alzheimer’s. Each action is written as a chapter. My biggest pet peeve was the repetition of the list. It’s like she thought the reader had memory loss and so repeats the same thing for 3 or 4 times.

For example, one recommendation is to ‘use your brain’. OK, got it. She repeats this theme at least 3 times as part of her list of 100; chapters “Build Cognitive Reserve”, “Get a Higher Education”, “Google Something” are all the same.

Another example: exercise to improve memory. This message to exercise is repeated in chapters “Be a Busy Body”, “Prevent and Control Diabetes”, “Enjoy Exercise”, “Avoid inactivity”, and “Watch your Waist.”

The irony is that this book is targeted for an ‘intelligent’ reader — it mentions a lot of scientific studies. I appreciate that, but if your reader can understand words like insulin and inflammation, then don’t you think the reader would notice filler pages?

There are not 100 things — maybe 20 things at best. It’s like the editor told the author — we can only sell a booklet of tips for 30 things for $5. But if you can make it a list of 100, then we can sell the book for $20. Boo! Hiss!

Jean Carper did mention some scientific studies that I’d like to follow up on, especially those specific to ApoE4.

Specific for ApoE4

  1. Alcohol is bad for Alzheimer’s

    8-14 beverages / week : 37% lower risk of dementia, but this does not apply to E4
    > 14 beverages / week : Doubles the odds for Alzheimer’s
    Adults who usually drink lightly or moderately, but go on occasional binges are 3x more likely to develop dementia
    Drinking with ApoE4 pushes Alzheimer’s 4-6 years earlier

  2. High homocysteine levels and ApoE4 is bad. Taking vitamin B can lower homocysteine levels
  3. Mice genetically destined to get Alzheimer’s (check if ApoE4) fed with nicotinamide (over-the-counter form of niacin) zipped through mazes and did not get Alzheimer’s. There are studies in progress to see if niacin-packed foods delay Alzheimer’s. Eat niacin-rich foods like tuna, salmon, turkey breast, sardines, peanuts, halibut, and chicken
  4. Don’t eat fast food (tested this in ApoE4 mice)
  5. ApoE4 carriers are especially susceptible to tiny blows to head
  6. Wear a nicotine patch for ApoE4 – cognitive boost greater in carriers of 2 copies hmm, don’t know how I feel about this

Treatments/Foods for Alzheimer’s in general

Ways to Measure

  1. Ankle-brachial index (ABI) test: leow ABI readings likely to get vascular dementia / Alzheimer’s
  2. Measure C-reactive protein. Keep it low (if you have too much inflammation, C-reactive protein levels become high
  3. Balance (how long can you stand on one foot)
  4. Systolic blood pressure > 140 mm in midlife, stronger predictor of demenita. Ideally,
    < 120 systolic, > 80 diastolic
  5. Keep homocysteine levels low or take vitamin B
  6. PET scans show deposits before any signs of mental impairment


Alzheimer’s Books: Tangles and The Little Girl in the Radiator

I read two books on Alzheimer’s from the caretaker’s perspective:

  1. The Little Girl in the Radiator
  2. Tangles

Tangles and The Little Girl in the Radiator are written from the perspective of the caregiver. Tangles is written by Sarah Leavitt about her mother. The main caretaker is her father.

In The Little Girl in the Radiator, the son (Martin Slevin) is the main caretaker.

I enjoyed Radiator Girl more because of its humor. Tangles is a comic book and I’m not a big fan of comic books. Both were good; tears were flowing at the end.

Here are some things that stood out for me:

Loss of Appetite
In both books, the patient loses appetite, and so is in real danger of getting weaker. This may be in part because the patient’s sense of smell is gone, and so flavor and taste comes from the tongue. As a result, both patients loved sweet food (sweet candies in Tangled and chocolate biscuits in the Little Girl in the Radiator). It’s bittersweet the lengths at which the patients would eat sweet food. In Tangled, the mother eats sweet candies without taking the wrapper off, while in the Little Girl in the Radiator, the mother buys 50 packets of chocolate biscuits.

Managing the bodily functions of an Alzheimer’s patient is hard. In Tangles, the mother’s hygiene has deteriorated to a point where there are bits of feces and the patient is oblivious. The author mentions multiple incidents, and I think the mother even pooed in her underwear. I’m not sure why they didn’t put the mother in diapers earlier.

There are some nice pictures of raised toilet seat which only a comic book could capture nicely.

I wonder if a bidet toilet would have helped the patient remain clean. I wonder if there are bidet toilets for elderly. You’d have to be used to using one too, and I know I’m not used to it.

Also in Tangles, keeping the mother clean is a challenge. The mother can’t brush her own teeth so her breath stinks. The comic book pictures a bathtub, which must be difficult to get in and out of. I wonder if a shower stall is easier for someone with Alzheimer’s. Hygiene seems a challenge and they hire two caretakers for the mother when the father is away at work or needs a break.

Pets offer love
In both books, pets seem to help. In Tangles, Sarah’s mother recognizes the cat instead of her own daughter, In The Little Girl in the Radiator, the patient adopts an unruly dog for a few months, and is the one person who can relate to the dog. There are some pretty touching stories about the dog stealing the Sunday roast, but at then finding the lost patient!

Lost concept of time / Waking up at night
In both books, the patient can’t sleep and wakes up in the middle of the night, expecting a hair appointment or having a full-on conversation. In Tangles, the husband turns on the TV all night to entertain his affected wife. This can be very hard on the caretaker because the caretaker needs rest too.

Interesting Observations by the Author of The Little Girl in the Radiator:
Suggestive and Easily Duped
This may depend on how social the patient is. In the Little Girl in the Radiator, the author describes how his mother is easily swindled by door-to-door salesmen. Thankfully, he had power of attorney and was able to dispute charges which the mother had agreed to.
Know the laws in your state and country and assert them! Get power of attorney so that any legal agreement that the patient enters into isn’t binding.
Another example of how the patient is impressionable, is that whatever she watches on television becomes reality for the patient.

Repetitive actions that speak to a larger insecurity
Martin Slevin was astute enough to interpret deeper meaning from his mother’s obsessions. The title “Little Girl in the Radiator” is a fixation that his mother has — that there’s a little girl trapped in the radiator. Martin learns from his mother in her last days that ‘she’ is the little girl in the radiator — that she feels trapped and can’t come out. There are other repetitive patterns such as asking to go to the hairdresser (the mother was always looking her best and well-coiffed) and locking her son out of the house when she felt insecure or that he was a threat.


ApoE4 and increased risk of Alzheimer’s

My significant other (SO) has a good chance of getting Alzheimer’s, based on his family history and genetics (he’s ApoE4 homozygous). We discovered this when he got his DNA tested.

So my goal (that sounds really ambitious – maybe I should say “aim”? Trying again…)

My aim is to find preventive measures for Alzheimer’s, specifically for ApoE4 carriers. My partner is 42 years old right now. Can we prevent Alzheimer’s or delay it? If we start early, maybe he can have another 5 years of sanity. I know taking care of him will be exhausting. I’m trying to spend some time now to see if I can delay/prevent his dementia.

Yes, drugs are in development. Even if a drug gets to the market by the time he needs it, it could have side effects, be too pricey, or may not work effectively on him.

So if there are natural, preventive measures, then why not give it a try? I’m going to read the research, and using my biology/genetics/disease background, see if it makes sense at a biology/molecular level. Maybe we can do an experiment!

Let’s start at the beginning — how did we find out that my SO is at risk for Alzheimer’s?

The regulations are always changing on whether a person has the right to know if they’re at risk for Alzheimer’s. In 2017, FDA permitted 23andMe to provide Alzheimer’s predictions[1 , 2]

My SO had been tested prior to this approval. We were able to still get the Alzheimer’s risk and others by

  1. Get the raw 23andMe data.
    (DNA never changes, the information is always there. Regulation was preventing customers from this information)
  2. Getting the interpretations from Promethease, a 3rd party service for $5
  3. APOE4 genotype from Promethease

Early 23andMe customers got Alzheimer’s predictions (2007-2013), but he ordered his test during the FDA ban period.

If you want to save money and not pay 23andMe’s $199 ancestry+health offering, you could pay for 23andMe’s $99 ancestry service, and then pay $5 to Promethease for the disease risk predictions. The diseases in the 23andMe report are well-studied, so no matter if you use Promethease or 23andMe, the results should be the same. (No guarantees as we haven’t paid for the $199 service, so I can’t check it.)

However, it’s not easy to look at Promethease’s report. The easiest way is to look at the interactive report (report_ui2.html) and see how many copies of APO-ε4 he had. ε4 is bad, while ε3 and ε2 are OK. Unfortunately, my SO had 2 copies of APO-ε4

Honestly, unless you have a strong background in genetics or Alzheimer’s, I’d pay the extra money to get the 23andMe report, because this is the type of stuff you don’t want to misinterpret. (I saw the early reports from 2007-2013, and they’re much clearer than Promethease.)

Still, looking specifically at Alzheimer’s and Promethease’s $5 report, I could tell that Promethease was looking at the right genetic mutations. Because he has two copies of the ‘bad’ allele, his risk for Alzheimer’s disease is increased 12x. He also has a family history of Alzheimer’s, so unfortunately, I think it’s only matter a time.

Here are the resources I’ve found so far:

  1. ALZFORUM has a lot of academic papers on ApoE4’s role in Alzheimer’s, which is great!
  2. Apoe4.info – People who have ApoE4 come to share on this forum. I especially like the I like “Our Stories” where people share what diet and exercise regiment they’re trying.

    Community Biomarker Archive is where people share their cholesterol, weight, and other body markers. I wish there were some test results that measure memory/cognitive decline, so people could know if what they’re doing/eating works.

Do you know any other resources out there for ApoE4?


FDA Validation of a PCR Test: Precision (Repeatability & Reproducibility)

Precision measures the consistancy of results. Precision is not accuracy. While accuracy is getting the answer right, precision is about getting the same answer over and over again. You don’t want to get an answer one day, and the next day, because it’s raining or the lab technician didn’t have their cup of coffee, it’s another answer!

A test needs to be accurate and precise. Accuracy was addressed in another post, this post will assess precision.

To assess precision, the lab runs the same sample under different conditions — different operators, machines, days, batches, etc. We’re trying to see that there are no significant differences, otherwise we need to address it. For example, if there’s a difference between machines, then this suggests the machines need to be re-calibrated.

ΔCt from two batches. Batch 1 and 2 look equivalent, as seen by the red boxplot (Batch 1) pairing nicely with a blue boxplot (Batch 2) for a given concentration.

x-axis labels describe the different concentrations below, at, or near the limit of detection (LoD).  FFPE indicates Formaldehyde Fixed-Paraffin Embedded samples.

ΔCt between the two instruments. Instrument 2 gives slightly higher values than Instrument 1, is this significant? The numbers will tell…

ΔCt between the two operators. The boxplots from the 2 different operators are similar  — nice job, guys!

ΔCt measures across the different runs. It’s a little noisy — ANOVA will tell us whether this is significant.

ΔCt measures across the different wells. For the most part, the boxplots at the different concentrations are grouped together.

While a picture is worth a thousand words, are these differences significant?  To explore this, we’ll use ANOVA or analysis of variance.

ANOVA stands for “analysis of variance” and is used to analyze the differences among groups. We expect the sample concentrations to contribute to differences in ΔCt. We hope that other factors like instrument and batch do not. If they do, something is wrong.

Here are some example ANOVA results.

 VariancePrecisionLower 95% CIUpper 95% CIp-value
Sample34.295.8611476.723 *10-141
Sample-Operator Interaction0.

It’s expected that sample concentrations contribute a lot to the variance (p =3*10^-141). The variance that the operators contribute is small (0.02) and not significant p=0.50. There is no interaction between the sample concentration and the operator.

 VariancePrecisionLower 95% CIUpper 95% CIp-value
Sample3.891.971.2554.071.2 * 10-43
Sample-Run Interaction0.

The variance that the run contributes is small (0.09) but significant p=0.001.

Here is the code that generates the statistics:

First, label the data by the different effects
In the data frame, we have a column labelled “Run”, and takes on values “Run1” and “Run2”. There is another column labelled “Well” and contains the well values.

effects <- c("Run", "Well", "Batch", "Operator")

Loop through the for loop to do ANOVA for each effect (variable).

for (effect in effects) {
 anova_results <- Anova (aov (deltaCq ~ factor (Sample) * factor ( eval (as.name (effect)) ),  
   data=cleaned_channels), type = "III", singular.ok=TRUE)
  anova_stats <- anova_stat_function (anova_results)


The anova_stat_function calculates the metrics seen in the tables:

anova_stat_function <- function (anova_results) {
   anova_df <- as.data.frame(matrix(nrow = nrow (anova_results) ))
   row.names (anova_df) <- row.names (anova_results)

   sum_of_squares_column = 1
   degrees_of_freedom_column = 2
   anova_df["Variance"] = anova_results[sum_of_squares_column]/anova_results[degrees_of_freedom_column]
   anova_df["Degrees of Freedom"] <- anova_results[degrees_of_freedom_column]
   anova_df["Precision"] = sqrt (anova_df["Variance"])
   # confidence interval is sum of squares divided by chi square
   chisq97.5 <- qchisq (0.975, anova_results[[degrees_of_freedom_column]])
   chisq2.5 <- qchisq (0.025, anova_results[[degrees_of_freedom_column]])
   anova_df["Lower 95% CI"] = anova_results[sum_of_squares_column]/chisq97.5
   anova_df["Upper 95% CI"] = anova_results[sum_of_squares_column]/chisq2.5

   return (anova_df)

Complete code can be found at Github under Rscripts_repeatability.txt and Rscripts_reproducibility.txt


FDA Validation of a PCR Test: Reportable Range

The Reportable Range plan validates the test under many different conditions. 3 RNA input levels (low, medium, high) and the target at 12 different concentrations, ranging from 0.0488% to 100%. By testing 36 different conditions (12*3), we’ll get:

  1. Summary statistics for each fusion concentration and RNA input level
  2. Amplification efficiencies at the different % fusion concentrations
  3. Relationships between:
    1. Texas Red Ct and RNA Input Level (should be positively correlated)
    2. FAM and Texas Red Ct Values (should be positively correlated)
    3. ∆Ct and Texas Red Ct Values (should not be correlated)
    4. FAM Ct and % Fusion Concentration (should be negatively correlated)
    5. TxRd Ct and % Fusion Concentration (should not be correlated)
    6. ∆Ct Value and % Fusion Concentration(should be negatively correlated)
  4. Multiple Linear Regression

Many relationships between x and y being tested; this is to make sure the assay is behaving as expected.

1. Summary statistics for each fusion concentration and RNA input level
For the summary statistics, we’ll just use the summary_function as described in a previous post. The summary_function prints out the common metrics like average, median, std dev, percentiles, etc.

2. Amplification efficiencies at the different % fusion concentrations

The amplification efficiency for the reaction is calculated using the formula:
Amplification Efficiency = (10 ^ (-1 / slope)) -1, (^ denotes “to the power of”).

This is our R function (because we are going to pass in different linear models based on RNA input or fusion concentration:

calc_amplification_efficiency <- function (lm) {
slope <- lm$coefficients[2]
amplification_efficiency <- (10 ^ (-1 / slope)) -1
return (amplification_efficiency)

all.lm <- lm (FAM.Cq ~ log10(Fusion_Concentration), data=channels)
print (calc_amplification_efficiency (all.lm))

Ideally your amplification efficiency is close to 1.

3. Relationships Between Two Variables
In this section, we fit a lot of linear regressions to check that the data behaves as expected. We generate a lot of plots, but it's really the slope of the linear regression that tells us whether the data behaves as expected.

Since we're looking at so many variables, a for-loop was necessary. This is a for-loop that looks at each concentration across the different measures (FAM, ΔCt, etc).

params_to_plot = c("deltaCq", "FAM.Cq", "Target_PTPRK_Ct")
conc <- c(100, 50, 25, 12.5, 6.25, 3.125, 1.563, .7810, .3906 , .1953, .0977, 0.0488 )

for (param in params) {
for (fus_conc in conc) {
subset_df = channels[channels$Fusion_Concentration == fus_conc,]
subset_df.lm <- lm ( eval (parse( text=param)) ~ TexasRed.Cq, data=subset_df)
confidence_interval <- predict (subset_df.lm, interval='confidence', level=0.95)
print (c("% fusion", fus_conc) )
print (coef (subset_df.lm))
print (confint(subset_df.lm, 'TexasRed.Cq', level=0.95))

Here are the linear regressions between FAM and Texas Red at various concentrations:

% ConcentrationInterceptSlope95% Confidence Interval around Slope
1001.490.94(0.91, 0.98)
501.530.99(0.95, 1.01)  
252.221(0.93, 1.06)
12.53.590.99(0.92, 1.04)
6.255.330.95(0.89, 1.01)
3.1254.041.06(0.97, 1.14)
1.56256.930.99(0.85, 1.12)
0.781253.961.17(0.96, 1.39)
0.39069.461.02(0.75, 1.30)
0.195318.670.72(0.16, 1.27)
0.097710.661.15(-0.52, 2.83)
0.048851.82-0.48(-25.19, 24.23)

The corresponding figure is below:

FAM Ct and Texas Red Ct Values

At higher concentrations, the slope is close to 1 so there is a strong correlation, but the slope deviates from 1 at lower concentrations.

Another for-loop in the code generates a whole lot of graphs...

Texas Red Ct and RNA Input Level

∆Ct and % Fusion Concentration

FAM Ct and % Fusion Concentration

Texas Red Ct and % Fusion Concentration

4. Multiple Linear Regression
This multiple linear regression has Texas Red and the fusion concentration as dependent variables.

mult.lin.regression <- lm (formula = deltaCq ~ TexasRed.Cq + log2(Fusion_Concentration) , data=channels)
summary (mult.lin.regression)
confint(mult.lin.regression, level=0.95)

∆Ct should not depend on the RNA input level (Texas Red Cq), so the coefficient should be close to or approximately 0, indicating that there was little relationship between ∆Ct and RNA input level. If the coefficient for log2 (% fusion) is large coefficient, this means % fusion concentration can predict ΔCt.


FDA Validation of a PCR Test: Run Control Specification (Part 5)

The purpose of run control specification is to find what is the acceptable range of values to accept a sample. For example, if a Texas Red value is detected in cycle 38 (Ct 38), this is too late in the PCR cycles and the signal could be due to contamination. Thus the sample is thrown out.

To get the acceptable ranges, the PCR test is run on clinical positive controls (FFPE and fresh frozen). Because Texas Red serves as an internal control, we look at its values to decide whether we accept/reject a sample.

The objective of the Run Control Specification is to find a range of Texas Red values that will deem a sample to be acceptable.

The code can be found on github.

We want to find the acceptable Texas Red range for the clinical samples. We analyze the positive controls and get prediction intervals based on the observed values.

The results of the Texas Red channel at 0.9, 0.95 and 0.99 confidence and 0.9, 0.95 and 0.99 coverage looks like:

 FluorConfidenceCoverage Level2-sided Lower Prediction Interval2-sided Upper Prediction Interval
Positive ControlTexas Red0.990.929.9631.38
Positive ControlTexas Red0.990.9529.9631.38
Positive ControlTexas Red0.990.9929.9631.38
Positive ControlTexas Red0.950.929.9631.38
Positive ControlTexas Red0.950.9529.9631.38
Positive ControlTexas Red0.950.9929.9631.38
Positive ControlTexas Red0.90.930.0431.35
Positive ControlTexas Red0.90.9529.9631.38
Positive ControlTexas Red0.90.9929.9631.38

From these results we choose the broadest range (29.96-31.38) because we want to be liberal accepting samples. For this range, any confidence and coverage level will do except 0.9 confidence 0.9 coverage.

How to get the Prediction Intervals

We make a data frame of the positive controls only (not the negative controls):

positive_control = df[df$Content == 'Pos Ctrl-1' | df$Content == 'Pos Ctrl-2' | df$Content == 'Pos Ctrl-3' | df$Content == 'Pos Ctrl-4', ]

And then a data frame for each channel Texas Red, Cy5, and FAM:
pc_TexRed = positive_control[positive_control$Fluor == 'Texas Red',]
pc_Cy5 = positive_control[positive_control$Fluor == 'Cy5', ]
pc_Fam = positive_control[positive_control$Fluor == 'FAM', ]

In the next section, I’ll show the analysis on Texas Red but this can be applied to all channels.

  1. Outlier removal by the Tukey rules on quartiles +/- 1.5 IQR
  2. pc_TexRed_noOutliers <- outlierKD (pc_TexRed, Cq)

  3. Test if the data looks normal or not:
  4. shapiro.test (pc_TexRed_noOutliers$Cq)
    The results give us:

    Results of Hypothesis Test

    Alternative Hypothesis:

    Test Name: Shapiro-Wilk normality test

    Data: pc_TexRed_noOutliers$Cq

    Test Statistic: W = 0.9494407

    P-value: 0.001008959

    which means the data is not normal.

  5. The tolerance package calculates prediction intervals.

    Load the library:

    library (tolerance)

    If the data looks normal, use the normtol function

    normtol.int (cleaned_df$Cq, alpha=alpha_num, P= coverage_level, side=2)

    If the data does not look normal, use a nonparametric tolerance function

    nptol.int (cleaned_df$Cq, alpha=alpha_num, P= coverage_level, side=2)

    Since the data is not normal, we'll use nptol.int. This loops through multiple confidence and coverage levels.

    confidence_levels <- c(0.99, 0.95, 0.90)
    nonparametric_coverage_levels <- c (0.90, 0.95, 0.99)

    for (confidence_level in confidence_levels) {
    alpha_num = 1 - confidence_level
    for (coverage_level in nonparametric_coverage_levels) {
    nonparametric <- ddply (all_controls_no_outliers, c('Fluor'), .fun = nptol_function, alpha=alpha_num, coverage_level=coverage_level )

    if (all (is.na (current_nonparametric))) {
    current_nonparametric <- nonparametric
    } else {
    current_nonparametric <- merge (current_nonparametric, nonparametric, all=TRUE)
    } # end coverage_level
    } # end confidence_level

This code generates the results table at the top if this post.


FDA Validation of a PCR Test: Limit of Detection (LoD) (Part 7)

Limit of detection is the sensitivity of the assay — how low a concentration can the test detect?

To test this, the lab did a dilution series over a range of concentrations. When the concentration is low enough, the fusion won’t be detected.

Using the data from the dilution series, we used 4 models to predict the limit of detection.

  1. Linear regression using all data
  2. Linear regression using just the concentrations where data was observed
  3. Logit
  4. Probit

1. Linear regression using all data

My data frame “chan” contains deltaCq values and the Percentage_Fusion.

First do the linear regression:

chan.lm <- lm ( deltaCq ~ log2 ( Final_Percentage_Fusion ), data=chan) summary (chan.lm)

with a prediction interval:

PI <- predict (chan.lm, newdata = new.dat, interval='prediction', level=0.90)

The fusion dilution where the top prediction limit crosses the delta Ct cut-off will be estimated and chosen to be the model's estimate of the C95.

My delta Ct cutoff is 5 (determined from the Accuracy study).

cutoff <- 5

Our linear model on the upper prediction interval is :

upper.lm <- lm (PI[,"upr"] ~ log2(conc) )

y = slope * x + intercept and hence x = (y- intercept)/slope. In our case x is the concentration, and it's what we want. y is delta Ct and we want to know what concentration is predicted to have a delta Ct of 5 (our cutoff).

slope <- upper.lm$coefficients[2]
intercept <- upper.lm$coefficients[1]

C95 = (cutoff - intercept)/slope

conc_LoD = 2^C95

conc_LoD is our limit of detection.

2. Linear regression using just the concentrations where data was observed

To do a linear regression on just some of the data, identify which concentrations to remove.

remove_because_partial_or_no_data <- c(0.0488, 0.0977, 0.1953, 0.3906, 0.781, 1.563)

Remove these concentrations from the data frame:

chan <- chan[!(chan$Final_Percentage_Fusion %in% remove_because_partial_or_no_data),]

and then follow the same steps as in #1.

3. Logit model

First, make a column of whether the PCR product is called as a positive or not. This column is called "CalledPositive".

chan["CalledPositive"] <- 1

If no signal is detected or deltaCq is > 5, then the fusion is not called and we have to assign CalledPositive as false or "0".

chan[is.na(chan$deltaCq), "CalledPositive"] <- 0

chan[!is.na(chan$deltaCq) & chan$deltaCq > 5, "CalledPositive"] <- 0

Run the logit analysis

mylogit <- glm (CalledPositive ~ log2Conc , family = binomial(link = "logit"), data = chan)

To find the sensitivity at 95% based on the logit model, I manually feed in ranges of values until I get something close to 95%.

For example, by inspecting the graph I know that my LoD is somewhere between 0 and 2 so I set

xlower <- 0

xupper <- 2

and then I make a series of points:

temp.data <- data.frame(log2Conc = seq(from = xlower, to = xupper, length.out = 10))

And I input temp.data to get predicted sensitivities:

predicted.data.logit <- predict(mylogit, temp.data, type = "response", se.fit=TRUE)

I then look at predicted.data.logit to see which concentration gives me close to 95%. This can take several iterations until I'm very close to 0.95

4. Probit model

We make use of the "CalledPositive" column defined from the section 3 logit model.

myprobit <- glm(CalledPositive ~ log2Conc , family = binomial(link = "probit"), data = chan)

To find the concentration with sensitivity of 95%, do the same steps as described in 3.Logit model, but use predicted.data.probit instead.

predicted.data.probit <- predict(myprobit, temp.data, type = "response", se.fit=TRUE)

You can find my complete linear regression R scripts and probit and logit scripts on github.


FDA Validation of a PCR test: Accuracy (Part 4)

In the Accuracy study, we will:

  • Set the acceptable working range
  • Set the cutoffs for calling a sample positive or negative
  • Assess how well our assay agrees with an alternative assay

Clinical samples can  be FFPE or fresh frozen. FFPE is usually poorer quality than fresh frozen, so the two sample types are treated separately.

Section 1. Setting the Acceptable Working Range

The “acceptable working range” is based on the observed values from clinical samples. Once the acceptable working range is set, samples that have values outside the working range will be rejected.

To calculate the acceptable working range, we look at the observed values of Texas Red for the wild-type samples and calculate prediction intervals. In R, alpha = (1 – confidence_interval) and P is coverage.

To calculate prediction intervals, we can use either normtol.int or nptol.int:

normtol.int (df$Cq, alpha=alpha_num, P= coverage_level, side=2) # for normally distributed data

nptol.int (df$Cq, alpha=alpha_num, P= coverage_level, side=2) #for data that's not normally distributed

The Shapiro test for normality:

shapiro.test (df$Cq)

showed that the Ct values were non-normal, so the non-parametric nptol.int was used to obtain the table below:

SampleAlphaConfidenceCoverage2-sided Lower2-sided Upper

Based on the table above, we decide on the acceptable working range. We usually go for the broadest range of values because we don’t want to reject too many samples.

Section 2. Set the cutoffs for calling a sample positive or negative

A sample will be called positive or negative based on its ΔCt value. Based on the plot of the ΔCt values with both positive and negative samples, we pick a cutoff that separates the negative samples from the positive samples.

Based on the figure above, we chose a ΔCt cutoff of 5.  Samples with ΔCt <= 5 will be mutation-positive and samples with ΔCt > 5 will be mutation-negative.


Section 3. Setting the Acceptable Working Range

This is fairly easy and doesn’t need complicated statistics.

Original MethodTotal
New MethodPositiveaba+b

Older posts «

» Newer posts