Query Tab

Overview

The form is divided into four main sections:

Each section of the form has an an associated checkbox.

Only checked sections are used when the Submit button is clicked. Unchecked sections are collapsed and hidden from view. Settings may persist in unchecked sections, but will be ignored unless the associated checkbox is checked.

Each section of the form has subsections which work the same way.

In general, a chain or chain pair must satisfy the specified criteria in all checked sections of the form, to be included in the results.

Pairing

Chains are paired in abYsis using a combination of sequence analysis results and textual information obtained from the data source. The pairing information is pre-calculated and stored in the database.

A chain pair will be included in the results only if both chains satisfy their applicable light or heavy chain specific criteria (as specified in the Structure, Sequence or Restrict to Antibody Class sections of the form) and if at least one of the chains satisfies criteria in the Basic section of the form. The Basic criteria are handled differently because sometimes, information that should be identical for both chains in a pair, such as antigen, author and organism, can sometimes be missing or miss-spelled.

Text Searches

Text searches in abYsis are case insensitive. Search and target terms are converted to a common case before all comparisons.

Supported comparisons include:

Operator Description
is Search for the exact text entered, but ignoring upper/lower case.
is like Allow wild cards '%' (any string) and '_' (any character).
A '%' is placed automatically at the beginning and end of the entered text.
Case insensitive.
~ (regex) PostgreSQL POSIX regular expression matching.
Case insensitive.

Basic

Data Source

Use the dropdown to restrict the search to a particular data source such as PDB (protein sequences with structures, but no DNA), abYsis-EMBL-IG (a collection of Protein + DNA sequences from EMBL, identified via similarity to curated IG sequences) or NCBI Germline (germline DNA sequences from NCBI).

Source ID or Accession

The Source ID is the primary entry-level identifier used by the Data Source. Examples include:

There are often multiple protein and/or nucleotide sequences in abYsis corresponding to a given Source ID.

The Accession is used internally by abYsis. It uniquely identifies a sequence in abYsis, for a given Source ID. Examples:

Where the Data Source does not provided a suitable sequence-level identifier and there are multiple sequences for a given Source ID, the abYsis accession is the Source ID with an appended counter e.g. A123456 (2), A123456 (3).

Name

The Name field is derived from textual annotations provided by the data source.

It may correspond to gene name, protein product, sequence title, mnemonic or other text description. Note that the text may not be parsable to provide a simple distinct name for the antibody since other text may also be present. For example, the name field for abYsis-EMBLI-IG entry ABD73927.1 is mAb3F2 immunoglobulin gamma heavy chain rather than simply mAb3F2.

Only a single name or search term should be entered.

Antigen

The Antigen field is currently populated only for Kabat sequences since this information cannot be parsed automatically from other data sources. Hence, if text is entered the search will be restricted to Kabat sequences.

Only a single name or search term should be entered.

Clone

The Clone field is populated only for data sources using EMBL format files. Hence, if text is entered the search will be restricted to those data sources.

Only a single name or search term should be entered.

Reference

The Reference field allows text to be compared with titles and publication details of the reference and patent data associated with each sequence. Patent data is populated only for data sources using EMBL format files.

Only a single name or search term should be entered.

Author

The Author field allows test to be compared with the surnames of the authors of the reference data associated with each sequence.

Only a single name or search term should be entered.

Publication Year

Select a Publication Year and use the adjacent dropdown to select whether you are interested in publications before, after or during that year.

The search will be restricted to those sequences which have at least one publication within the specified range.

Organism

Use the Organism dropdown to select an organism name. Organisms names have been parsed from the data source, with some automated error checking and/or mapping via aliases. The organism name stored in abYsis is almost always the species or sub-species, sometimes the genus and very occasionally a common name.

The search will be restricted to organisms with that name and organisms that start with that name e.g. if you search for Rattus the search will allow Rattus rattus and Rattus norvegicus.

Note that species information is taken from the source data files.

Exclude sequences with warnings

Check this box to exclude sequences with warnings.

A small fraction (<1% of public data) of abYsis sequences carry warnings. The bulk of these are germline DNA sequences flagged as pseudogenes or non-functional.

Exclude unclassified sequences

Check this box to exclude unclassified sequences.

Sequences are classified in abYsis as heavy, light, kappa or lambda using a combination of textual annotations provided by the data source and computed annotations made by abYsis. In some cases textual annotation is incomplete or ambiguous and in some cases abYsis may fail to determine a chain type. Where there is an inconsistency, the computed annotation is preferred and the sequence is tagged with a warning.

Exclude unpaired sequences

Check this box to exclude unpaired sequences.

Light and heavy chain sequences are paired in abYsis using a combination of textual annotations provided by the data source and computed annotations made by abYsis. A cautious approach is taken to pairing to avoid incorrect pairs at the expense of missing some correct pairs.

Exclude un-numbered sequences

Check this box to exclude sequences that are un-numbered because the automated numbering has failed.

Not all sequences can be numbered. For example, some sequences with large and/or unusual deletions or insertions cannot be numbered. All numbered sequences are classified, but there are some classified sequences that are not numbered. Protein sequences shorter than 70 residues are not processed through the numbering pipeline.


Structure

Overview

This part of form focuses on structural searches.

Some options require structural information directly and thus have the effect of restricting the search to sequences for which atom coordinates have been loaded in the database i.e. PDB sequences. Other options exploit sequence/structure relationships and may be used for all sequences, whether or not there is structural information available.

Search terms may be entered separately for heavy and light chains. If you specify terms for both heavy and light chains in this section or elsewhere in the form, the search will be restricted to paired chains.

All the options in this section refer to numbered positions. If search terms are entered, the search will implicitly be restricted to chains that have been numbered.

Sequences with a predicted canonical class

If this section of the form is checked, the search will be restricted to chains with loops classified as belonging to a specified canonical class based on the presence of key residues.

Each canonical class (a structural concept) is encoded via a set of sequence rules. The rules for a given class allow particular residues at certain required positions i.e. positions which must be present, for a match. Via these rules, canonical classes can be predicted for all numbered chains, irrespective of whether structural information is available.

See the help on the About/Definitions page for a description of terms ('Exact' / 'Similar') and the different classification 'Methods'.

Residues within x Ångströms of your chosen position in known structures

Restricts the search to structures that have (or do not have) particular amino acid types within a paticular distance of a specified residue position.

Use add row to require additional positions or click delete row to remove positions that are not required.

If the Chothia Position dropdown is not selected in a given row, no constraint is applied and the row is effectively removed from consideration. Similarly, if the at least one of option is selected in the Constraint dropdown, but nothing has been entered in the text box, no further constraint is applied.

Residues within x Ångströms of your chosen position in known structures and predicted in sequences

Restricts the search to structures that have (or do not have) particular amino acid types within a paticular distance of a specified residue position. In addition to known structure, this allows you to consider sequences that are numbered but do not have structural data.

It uses a pre-calculated set of distance distributions calculated from known antibody structures to predict residues in the vicinity of a selected position based on correct numbering of the antibodies.

The original distributions were calculated using many hundreds of structures of numbered antibody chains so the prediction algorithm uses an average distance and a standard deviation.

Results are brought back when m < d + nσ, where m the is mean C-alpha to C-alpha distance between the positions, d is the spcified distance and σ is the standard deviation.

Use add row to require additional positions or click delete row to remove positions that are not required.

If the Chothia Position dropdown is not selected in a given row, no constraint is applied and the row is effectively removed from consideration. Similarly, if the at least one of option is selected in the Constraint dropdown, but nothing has been entered in the text box, no further constraint is applied.


Sequence

Overview

Use this part of the form to restrict the search to chains which have a protein sequence that contains specified fragments, motifs or residues.

Terms may be specified separately for heavy and light chains. If you specify terms for both heavy and light chains in this section or elsewhere in the form, the search will be restricted to paired chains.

Each option (with the exception of the chosen motifs within complete chain search) refers to be numbered positions or delineated regions within chains which depend on the numbering. If search terms are entered for any of these, the search will be automatically restricted to chains which have been numbered.

Search for chosen motifs within complete sequences

Restrict the search to sequences which contain particular fragments or motifs.

The search is case insensitive. Regex searches use PostgreSQL POSIX regular expression matching.

Search for chosen motifs within specific regions

Restrict the search to chains with regions which contain particular fragments or motifs.

The search is case insensitive. Regex searches use PostgreSQL POSIX regular expression matching.

A chain will be included in the results only if all the entered criteria are satisfied.

Specify minimum and maximum lengths for regions.

Restrict the search to chains with regions which have lengths in a specified range.

A chain will be returned in the results only if all the entered criteria are satisfied.

Constrain amino acids at required positions

Restrict the search to chains which have particular residues at positions of interest e.g. Chothia key residues or residues known to be important in humanization.

By specifying a Required Position but without specifying any Amino Acids all chains are identified that have that numbered position in the sequence irrespective of the amino acid present. This could be used for finding sequences with an unusual insertion of interest.

Use add row to require additional positions or click delete row to remove positions that are not required.

If the Required Position dropdown is not selected in given row, no constraint is applied and the row is effectivel removed from consideration.

A chain will be included in the results only if all required positions are present and all the associated amino constraints are satisfied.


Restrict to Antibody Class

Use the dropdown to specify a chain class. The classification is hierarchical e.g. a search for Heavy Gamma chains will also retrieve chains classified as Heavy Gamma 2 A.

Chains are classified in abYsis using a combination of sequence analysis results and textual information provided by the data source.

In some cases the textual information relating to chain type or class is missing or ambiguous, while in other cases sequence analysis fails to give a clear determination. Where there is an inconsistency, the sequence analysis result is preferred and a warning is flagged. Some chains remain unclassified.

If you specify terms for both heavy and light chains in this section or elsewhere in the form, the search will be restricted to paired chains.