Elimination campaigns for neglected tropical diseases call for specialised surveillance methods. In any elimination campaign, the effort required to find residual infections increases dramatically as prevalence approaches zero. For neglected tropical diseases, the rising cost of surveillance can be difficult to sustain in resource-poor settings. This is exacerbated for many macroparasitic infections, such as lymphatic filariasis (LF) and onchocerciasis, because much of the infected population can remain asymptomatic for months or years. Molecular xenomonitoring (MX), the surveillance of disease vectors for evidence of infection using DNA-based PCR methods, can be a cost-effective and non-invasive surveillance technique for vector-borne diseases. Disease vectors (e.g. mosquitos, blackflies) can be trapped *en masse* and tested in batches or pools, keeping down the cost of surveillance. The results of these pooled tests can then be used not only to identify sites with active transmission, but also quantify the intensity of transmission by estimating the prevalence of infection within the vector species.

Unfortunately, pooled testing complicates the analysis and interpretation of results. With pooled testing we do not obtain a positive or negative result for each individual vector — instead we obtain a positive or negative result for each pool of vectors. Pool-tested data therefore requires custom methods and software for even apparently simple tasks like estimating infection prevalence. In the early 2000s, a team lead by Professor Charles Katholi (University of Alabama at Birmingham) supported by the Onchocerciasis Control Programme in West Africa created the 'Poolscreen' software to address this gap, applying it to MX surveys of blackfly vectors of onchocerciasis. Poolscreen, a stand-alone program with an intuitive user-interface, soon became the standard software for estimating prevalence for MX surveys around the world and was actively updated until 2010.

## Modernising molecular xenomonitoring software

At the end of 2019, Dr. Helen Mayfield approached me with a problem. She was working on a large COR-NTD funded project — Surveillance and Monitoring to Eliminate LF from Samoa (SaMELFS)* — that included surveys of humans and mosquitos. Helen’s colleague, Lt. Col. Brady McPherson, was trying to analyse the results of the MX arm of SaMELFS. The large dataset included mosquitos from a variety of *Aedes* and *Culex* vector species, sampled from across several regions, dozens of villages, and many sampling sites. Poolscreen was working really well for estimating mosquito infection prevalence across the whole country but the tool proved to be quite cumbersome when it came to estimating prevalence by region, village, household, or vector species. Helen thought that there must be a way of automating all this and asked me if I could use my mathematical modelling skills to help out.

I surprised Helen (and myself!) by pulling together a simple R program in an afternoon. The program could read in MX data, split it into groups based on any of the variables (e.g. by region), and estimate prevalence separately for each group. Using the new program, Brady's vector and location specific analyses only took a few minutes and a handful of lines of R code — one line to read in the data, one line each for estimating prevalence for each way of splitting up the data (e.g. splitting by region, or splitting by region and vector species), and a few lines for saving the outputs.

With the growing use of MX, we thought we should make the code available to the community of researchers and public health officers doing MX around the world. It was at this point that we noticed three gaps that still needed addressing. Buoyed by the easy victory in the developing the prototype program, we set about filling these gaps by turning the prototype code into a full R package. The package, called PoolTestR, fills two of these gaps already!

The first gap was that Poolscreen and our prototype code were both assuming that the disease vectors were collected using simple random sampling. But in practice, MX surveys are much more like cluster-randomised surveys; traps are placed at randomly or systematically chosen sites and these traps attract the vector population in their immediate vicinity. Failing to account for hierarchical sampling designs will tend to underestimate sampling uncertainty and bias prevalence estimates. The creators of Poolscreen have noted the hierarchical sampling structure inherent to MX surveys and cautioned users accordingly, but their software doesn’t provide a way to adjust for this hierarchy.

Second, while it is common to use disease surveys in humans to estimate odds ratios and identify risk factors for infection within the framework of logistic regression, this hasn't been commonly done for MX surveys — in part due to the complexity introduced by testing vectors in pools.

We realised that we could fill both gaps simultaneously by providing tools to do mixed-effect logistic regression modelling for pooled data. Logistic regression modelling would provide a framework for estimating familiar odds ratios and identifying risk factors for prevalence in vectors, while mixed-effect regression could be used to account for hierarchical sampling frames. Unfortunately, creating a complete R package with these extensions proved to be more complicated than developing the original prototype. It took the help of my colleague Dr. Ben O’Neill and many an afternoon squeezed between other projects, paternity leave, and a global pandemic before PoolTestR was officially published on the official repository of open-source R packages. It’s now straightforward to estimate infection prevalence in vectors even from large and complex MX surveys across many sampling sites and vector species.

## Where next?

The third gap we identified is yet to be filled: while decades of research have been committed to the design of surveys with individually examined people or specimens, there is very little that addresses the concerns of surveys with pooled specimens or the particular concerns of MX surveys. What is the appropriate sample size if samples are tested in pools? How many sampling traps or sampling sites should be used? How should sampling sites be distributed spatially? How should vectors be divided into pools? Is it better to use a single pool size, or to split vectors across pools of widely different sizes? How can the results of previous MX surveys be used to inform the design of future surveys? How should we adapt MX surveys as we get closer and closer to zero prevalence? How should we adjust surveys to account for imperfect sensitivity and specificity of tests?

As these questions are tackled, we hope to incorporate MX survey design tools into PoolTestR and make them readily accessible to anyone working to eliminate vector-borne diseases. In the meantime, PoolTestR continues to be a living project, so we would love to know how you are using the package and hear your suggestions for improving the tools.

About the Author

*Dr. Angus McLure is a postdoctoral fellow at the Research School of Population Health, Australian National University. His research interests are in the mathematics of infectious diseases, with an emphasis on zoonotic and vector-borne diseases.*

About PoolTestR

*PoolTestR is a free and open-source package for R. To learn more about installing and using PoolTestR visit **github.com/AngusMcLure/PoolTestR#pooltestr**. To learn more about the methods and advanced uses of the package, read the article pre-print written by Angus McLure, Ben O'Neill, Helen Mayfield, Colleen Lau, and Brady McPherson available at **arxiv.org/abs/2012.05405**.*

* SaMELFS was a large project led by Prof Colleen Lau (Australian National University/University of Queensland) and Prof Patricia Graves (James Cook University), in collaboration with the Samoa Ministry of Health, Samoa Red Cross, and the Australian Defence Force Malaria and Infectious Diseases Institute