Citing
If you use RuleGO, please include the following publication in your references: Gruca A., Sikora M., Polanski A.: RuleGO: a logical rules-based tool for description of gene groups by means of Gene Ontology. Nucleic Acids Research 2011; doi: 10.1093/nar/gkr507. [PubMed]
Documentation

RuleGO is web-based application for describing gene groups using decision rules based on Gene Ontology terms. It takes as an input two lists of genes: a list of genes to be described and a reference list of genes. As a result one obtain a list of the rules that allow to describe an input list of genes with the use of the conjunction of gene ontology terms. The rules determined have the following meaning:

if a gene is described by a conjunction of gene ontology terms appearing in a rule premise, then it belongs to an analyzed group of genes

The rules have a statistical significance level determined by a user and are sorted according to the ranking obtained by a rule quality measure. Obtained rules also consider co-occurrence of the terms in a given gene group and the presented method guarantees that the co-occurrence will not be trivial (for example, resulting from hierarchy of the ontology graph).

Rules generation process consist of the following steps:

Each of these steps can be controlled by the user by setting appropriate parameters values. Click on the chart below to see the diagram describing whole process and names of attributes that are involved in each of the above steps.

Primary set

It is a list of genes symbols for which rules are to be generated. Genes in the list should be separated by comma, colon or each gene symbol should be in a separate line. A list of allowed gene symbols is available here.

Reference set

List of genes symbols used as a reference set. A user can paste a list of its own genes or choose Rest of genome option which will include as a reference set the rest of genome of the selected organism.

To evaluate a statistical significance of created rules a hypergeometric test is used. Assuming that nφ is a number of genes that are described by gene ontology terms appearing in a rule premise and nψ is a number of genes belonging to the primary set we can create the following contingency table:

where:

  • nφψ - number of genes that are described by gene ontology terms appearing in a rule premise and belong to the primary set
  • nφ¬ψ - number of genes that are described by gene ontology terms appearing in a rule premise and DO NOT belong to the primary set
  • n¬φψ - number of genes that are NOT described by gene ontology terms appearing in a rule premise and belong to the primary set
  • n¬φ¬ψ - number of genes that are NOT described by gene ontology terms appearing in a rule premise and DO NOT belong to the primary set

Using information from the above contingency table we calculate p-value of the hypergeometric test using the following formula:

In the current version of application only a hypergeometric test for over-representation of gene ontology terms in the primary set is available. A one-side (right-side) test is used because we assume that for descriptive purposes we are interested only in conjunctions of gene ontology terms with the bigger frequency than frequency that would result from random assignment of gene ontology terms to genes composing the group. For each rule there is also provided a corrected p-value based on Benjamini and Hochberg method for false discovery rate computation.

Gene Ontology consortium provides structured and controlled vocabulary that is used to describe genes and their products independently of the species. The GO database is organized into three disjoint directed-acyclic graphs (DAGs) describing biological process (BP), molecular function (MF) and cellular component (CC). Each node of the graph is called a GO term, and it is a single unit that describes some known biological process or function of the gene. The dependences between GO terms are hierarchical and as the DAG is traversed from the root into its leafs, the terms are inspected from the general ones to the more specific concepts. The section Gene Ontology allows selecting which ontology should be used used to annotate analyzed group of genes. You can select any combination of biological process, molecular function and cellular component aspects.

The section Gene Ontology allows selecting which ontology should be used used to annotate analyzed group of genes. You can select any combination of biological process, molecular function and cellular component aspects.

Hierarchical annotations

The hierarchical structure of the ontology database allows representing biological knowledge on the multiple levels of details. Terms at the higher levels (closer to the root) describe more general function or process while terms at the lower levels are more specific. To preserve the clarity of the ontology, the annotation files that are available at the Gene Ontology consortium website include only "original" annotations, that is annotations that were assigned to the particular GO terms by curators. The annotations resulting from the "true path rule" (annotation of that gene to all parent nodes of that term) are not included in the annotation files. If the option hierarichal annotations is checked then true path rule is satisfied

Non IEA Terms

This option allows excluding IEA annotations from analysis

Ontology level

This option is used to set minimal and maximal level of a GO terms that are used for description

The GO terms which should be included into the rules

This is a list of important GO terms which can be used to build the rules. Provided terms are used as a "seed terms" and each rule will include at least one GO term from the provided list.
During filtration process the rules that include longer combinations of GO terms from the list provided have higher position in the ranking and the RuleGO method tries not to remove them from the output set of rules.
This option can be particularly useful when analyzing set of genes from experiments related to particular biological processes or functions. For example when analyzing set of genes from tumor samples one may be interested in annotations related to so-called hallmarks of cancer. In order to facilitate analysis, several pre-defined lists of GO terms are also provided. These terms are related to some important mechanisms altered in cells which develop cancer. If option Use only above terms to create rules is selected, then RuleGO algorithm will generate rules using only such terms. In this particular case we suggest to provide the longer list of GO terms. If the list is small it is very likely that no statistically significant combinations of GO terms will be generated.

The GO terms which should NOT be included into the rules

This option allows the user to provide a list of GO terms which should be excluded from analysis.

This secion allows setting options of rules generation algorithm.

Minimal number of genes described by GO term

This option eliminates from analysis genes that are described by a less number of GO terms. Value of this parameter cannot be lower than minimal support value. This parameter settings can have a big influence for the computation time thus it is not recommended to set its value below 3.

Maximal number of GO terms in a rule premise

This option is used to set a maximal number of elements included in a rule premise. Increasing the value of this parameter will result in generation of more specific rules (described by the lower number of genes). One could expect that increasing the value of this parameter will result in increasing the number of output rules. However this is true only to some limited value, due to the restrictions that are applied to generated rules (i.e., statistical significance, minimal number of genes described by the rule).

Minimal support

This option is used to set a minimal number of genes that describe generated rules. Only rules described be more or equal number of genes that are defined in Minimal support option will be presented to the user. Value of this parameter cannot be bigger than value of minimal number of genes described by GO term. This parameter settings can have a big influence for the computation time thus it is not recommenced to set its value below 3.

Maximal number of generated rules

This option is used to set a maximal number of generated rules. Only the N best rules, where N is a number defined by a user will be provided to the users. The quality criterion is defined by a user in Output rules order section.

Please, take into consideration, that algorithm option settings may have big influence for a time of computation. By clicking the chart below you can see how different settings of rule generation parameters can influence time of computations. The analysis were performed for fixed p-value and for several different values of minimal support parameter.

Filtration is the last step of analysis and allows extracting from the set of all generated rules only the best and the most interesting ones.

The filtration algorithm is executed in a loop. Beginning from the best rule in the ranking, all rules covering the same set of genes or its subset are candidates to be removed from the result rules set. However, before removing any rule, its similarity to the reference rule is verified. If a rule is similar to the reference rule in more than a threshold defined by the user, it is removed from the set of determined rules, otherwise it remains in the output rules set.

The parameters that influences the results of filtration are:

  • quality measure used to establish output rules order
  • similarity threshold used during the filtration process

The user can choose between two quality measures: p-value and compound quality measure. P-value is computed based on the hypergeometric test as described in Statistical Test section.

Compound quality measure is computed as the product of the three component measures:

Q(s)=mWS(r)length(r)depth(r)

where: mWS(r) - is denoted as rule quality option and is a rule quality computed using the following equation:

  • acc(r) - is a rule accuracy, that is ratio of a number genes that belongs to the Primary Set and are described by GO terms form rule premise to a number of genes that are described by GO terms in both sets
  • cov(r) - is a rule coverage, that is ratio of number of genes that belongs to the Primary Set and are described by GO terms form rule premise to a number of genes from Primary Set
  • length(r) - is denoted as rule length option and is computed using the following equation:

    where:

    • NoGOterms(r) - is the number of GO terms in the r rule premise
    • MaxGOterms - is the maximal number of GO terms in the longest rule that describes the genes from Primary Set
  • depth(r) is denoted ontology level option and is copulated using the following equation:

    where:

    • level(ai) - is the level of a GO term ai that occurs in the rule r premise
    • max_path(ai) - is the longest path leading from the root to a leaf of GO graph that passes trough the node ai

According to the user requirements any element of the compound quality measure can be removed from the measure by deselecting its corresponding checkbox. For example, if the user is interested in obtaining the rules which include many GO terms in their premises, he or she can deselect rule quality and ontology level checkboxes in Rules filtration section.

Rules similarity

Similarity of two rules (ri and rj) is computed according to the following formula:

where:

  • #GOterms(ri ,rj) - is a number of unique GO-terms occurring in the rule ri and not occurring in the rule rj
  • #GOterms(r) - is the number of GO-terms in the rule r premise

The GO-term a from the rule ri is recognized as the unique if it does not occur directly in the rule rj and there is no path in GO graph that includes both term a and any term b from rule rj premise.

Output rules order

Generated rules are sorted according to one of the selected criteria: the compound quality measure which is described in the above section or by a p-value computed using hypergeometric test as described in Statistical Test section.

We encourage the RuleGO users to experiment with the filtration parameters settings. In the Example of different settings of rules filtration parameters section we present several different output set of rules obtianed for the same GO annotation and rules generation parameters.

The results of analysis (output set of rules) are presented on the RuleGO website. For each set of generated rules we provide number of output rules and information about the coverage (the percentage of gnes form the Primary Set which are described by generated rules).

The text file with output rules can be downloaded by clicking [download rule file] link on the result page. The list of output rules is also presented on the website.

Results of analysis

The result set of rules is presented in the form of list which can be also downloaded as a text file or pdf file.
For each rule the following information is provided:

  • Number of genes supporting the rule - number of genes from the Primary Set described by the rule.
  • Number of genes recognizing the rule - number of genes from the Primary Set and the Secondary Set described by the rule.
  • Accuracy - ratio of number of genes supporting the rule to number of genes recognizing the rule. If the rule is specific for the described Primary Set of genes, then this value is close to one.
  • Coverage - ratio of number of genes supporting the rule to number of genes from the Primary Set. This value shows how general the rule is. The closer to one, the more genes from Primary Set are described.
  • P-value - statistical significance of the rule computed by hypergeometric test.
  • FDR corrected p-value: corrected p-value based on Benjamini and Hochberg method for false discovery rate computation.
  • Quality: Rule quality measure based on Compound Quality Measure.

To see the quality indices for particular rule you need to click on Show details... link which is located at the header of the rule. You can also expand this information for the whole set of rules by clicking Expand all link.

Below we provide link to the exemplary result set of rules.
Exemplary result
Exemplary rule
#rule No: 1
GO:0000280 /// nuclear division /// biological_process /// (sup=5)(rec=8)(level=5)
GO:0000278 /// mitotic cell cycle /// biological_process /// (sup=8)(rec=10)(level=4)
GO:0022402 /// cell cycle process /// biological_process /// (sup=10)(rec=20)(level=4)
GO:0006464 /// cellular protein modification process /// biological_process /// (sup=4)(rec=17)(level=6)

Number of genes supporting the rule: 4
Number of genes recognizing the rule: 4
Accuracy: 1.0
Coverage: 0.36363636363636365
FDR corrected p-value: 1.6753240009480362E-5
Quality: 0.2944214876033058

Symbols of genes supporting rule No. 1:
S000004200, S000002314, S000002525, S000001505

Below we present the Table incuding five different sets of rules. Each set of rules was generated for the same input lists of signature and reference genes, using the same GO annotation and rules generation parameters. The only difference was in the settings of the filtration parameters.

Filtration Output sort order Number of output rules Link to file
NO p-value 28539 download rules file
YES compound quality measure (quality,length,depth) 23 download rules file
YES compound quality measure (quality only) 25 download rules file
YES compound quality measure (length only) 26 download rules file
YES compound quality measure (depth only) 19 download rules file