# Element

### What does Element do?

Element recognizes over-represented motifs in one or more sets of user provided promoters. For the online implementation of Element, the sets are exclusively reserved to be promoters from different genomes. However, in practice Element can analyze any sets of promoters originating from distinct backgrounds.

The background statistics reflect how many times a given motif is expected to occur in a randomly selected promoter from the set. These probabilities are then used in a large number of hypothesis tests asking whether or not the observed set of frequencies is significantly different than would be expected if from a random set of promoters of the same size. Element then uses well established alignment algorithms to aggregate the significant motifs and present the results as a set of clusters associated with any known biological function.

### How is the "Hit P-value" calculated?

The hit p-value indicates if a motif is overrepresented by appearing in more promoters than should be expected for a random sample. It is based on the hit count statistic, which is defined as the number of promoters that contain one or more copies of the motif i.e. a promoter is considered a "hit" if it contains the motif. This statistic is then a binomial random variable with a distribution determined by the background hit probability. The p-value is calculated from this binomial distribution using the observed hit count as measured in the queried promoters. When multiple genomes are analyzed together, the hit count statistic is a multivariate binomial random variable. Accordingly, the p-value is calculated using the exact multivariate binomial distribution.

### How is the "Mean P-value" calculated?

The mean p-value indicates if a motif is overrepresented by appearing more times in the average promoter than should be expected from a random sample. It is based on the mean count statistic, which is defined as the average number of times a motif occurs in a promoter. The mean count statistic is well approximated by using a Poisson random variable, since the number of motif occurrence in any given promoter can be modeled as a Poisson random variable and the sum of all these variables is still a Poisson random variable. The p-value is calculated using the Poisson distribution for the mean count with the observed mean from the query set.

### Why is Element not detecting any significant motifs in my query?

A common problem is that users submit far too few promoters for analysis. Element is not using a pre-established database of TF binding sites to identify potential overrepresented motifs. Thus, it requires a large input set to build up enough statistical significance for particular motifs. There is no magic number after which point you will start to get rich results, but if the regulatory mechanisms of your promoters are truly correlated, anywhere over 200 should be expected to give at least some results. The most effective procedure for improving your results to strongly curate your input set ensuring that a maximal number are expected to share regulatory mechanisms.

### How can I download the unfiltered results for all of the words?

This can be done by using the fixed cutoff filter and choosing a cutoff of 1, meaning that nothing will be filtered out. You will then be able to download the results using the link on the job page.

### What do the different filters do?

Since such a large number of hypothesis tests are made, natural statistical variation will result in an equally large number of false positives. The standard one-tailed cutoff of 0.05, which is effective for individual hypothesis tests, is insufficient at limiting the number of false positives. Instead, a selection of standard multiple testing filters are provided:

• Benjamini-Hochberg (default)

The Benjamini-Hochberg filter is a widely used filtering process that controls the false discovery rate (FDR), which is defined as FDR = E[(false rejections)/(total rejections)]. When applied, it ensures that the FDR is less than or equal to the selected cutoff (this is typically the same cutoff that would be used for a single hypothesis test). So, it allows for false positives, but it effectively ensures that on average significant motifs are truly overrepresented. For a cutoff of 0.05, only 5% of the reported results are false positives.

• Bonferroni Correction

The Bonferroni Correction or Bonferroni method is a simple filtering process that controls the familywise error rate (FWE), which is the probability that at least one significant result is a false positive. The familywise error rate will be less than or equal to the cutoff, which for the standard cutoff of 0.05 means that there is a 95% chance that there are not false positives. It does this by comparing the to p-values to cutoff/total tests. Hence, it is extremely conservative and consequently underpowered.

• Bonferroni-Holm

The Bonferroni-Holm method is an improvement upon the Bonferroni correction. It likewise controls the familywise error rate, but it is more powerful as it provides more opportunities for the null hypothesis to be rejected. Rather than comparing every p-value to cutoff/total test, the Bonferroni-Holm filter works by stepping through the ordered p-values comparing the first to cutoff * 1/total test, the second to cutoff * 2 /total test, and so forth until a hypothesis is not rejected.

• Fixed Cutoff

The fixed cutoff option allows for the results to be truncated if their p-values exceed the fixed cutoff value. Considering the number of hypothesis tests performed, the cutoff is not meaningful, but it allows the user to truncate the results an potentially investigate motifs that are not significant under any of the other methods. Note that unless the cutoff is small, many of the 43848 will be retained in the results making the result pages slow to load.

### How are the background statistics calculated?

For a given organism, the background statistics are calculated based on frequencies of occurrence of all possible 43848 3-8mer words across all promoter sequences. The promoter sequences are taken to be the 500-200bp upstream sequence starting at the beginning of every locus as determined by that organism's canonical reference assembly.