Welcome all readers, viewers, researchers and aspirants to this site for upgrading knowledge and aptitude in pharmaceutical industry oriented clinical research. "If you think Research is Expensive, Try Ineffectiveness in Illness"
Pages
- Home
- About the author
- Ask Your Query -forums
- Articles
- Posts
- FORMS AVAILABLE
- News Updates and Bulletins
- Announcements
- Links
- journals
- Pharm D India
- Pharma Mnemonics
- the New age health advocacy,communication, literacy and social tweetchat Posts
- contact details
- Pharma Corner- Enhancing Profession by Knowledge Interconnection
- Photo and Video Gallery
- Clinically Oriented blogs and news- Clinical Research
- Industry oriented blogs and news
- The International pharmacist
- Number of Publicly Listed pharmaceutical companie...
- Number of Biotechnology companies in India
- number of CROs
- number of Indian Franchises/Subsidiaries/ Corporates & Indian Pharma companies in Clinical Research
- Testimonials
- FPGEE and NAPLEX
- societies,institutions,organizations,associations,confederations and alliances
- Funny Pharmatoons
- Pharmaceutical, medical and clinical councils
- Pharmaceutical, medical and clinical Boards
- Pharmaceutical, medical and clinical licensing exams
- rules , protocols and regulatory affairs
- Achievements
- sams epharmacy cum pharmaclinic cum dic cum pvc- look at my presentations and publications
Saturday, 29 June 2013
Thursday, 20 June 2013
Fundamental Of Trial Design : Randomized Controlled Trials
INTRODUCTION
Randomized clinical trials are scientific investigations that examine
and evaluate the safety and efficacy of new drugs or therapeutic
procedures using human subjects. The results that these studies generate
are considered to be the most valued data in the era of evidence-based
medicine. Understanding the principles behind clinical trials enables an
appreciation of the validity and reliability of their results.
What is a randomized clinical trial?
A clinical trial evaluates the effect of a new drug (or device or
procedure) on human volunteers. These trials can be used to evaluate the
safety of a new drug in healthy human volunteers, or to assess
treatment benefits in patients with a specific disease. Clinical trials
can compare a new drug against existing drugs or against dummy
medications (placebo) or they may not have a comparison arm. A large
proportion of clinical trials are sponsored by pharmaceutical or
biotechnology companies who are developing the new drug, but some
studies using older drugs in new disease areas are funded by health
related government agencies, or through charitable grants.
In a randomized clinical trial, patients and trial personnel are
deliberately kept unaware of which patient is on the new drug. This
minimizes bias in the later evaluation so that the initial blind random
allocation of patients to one or other treatment group is preserved
throughout the trial. Clinical trials must be designed in an ethical
manner so that patients are not denied the benefit of usual treatments.
Patients must give their voluntary consent that they appreciate the
purpose of the trial. Several key guidelines regarding the ethics,
conduct, and reporting of clinical trials have been constructed to
ensure that a patient’s rights and safety are not compromised by
participating in clinical trials.
Are there different types of clinical trials?
Clinical trials vary depending on who is conducting the trial.
Pharmaceutical companies typically conduct trials involving new drugs or
established drugs in disease areas where their drug may gain a new
license. Device manufacturers use trials to prove the safety and
efficacy of their new device. Clinical trials conducted by clinical
investigators unrelated to pharmaceutical companies might have other
aims. They might use established or older drugs in new disease areas,
often without commercial support, given that older drugs are
unlikely to generate much profit. Clinical investigators might also:
- look at the best way to give or withdraw drugs
- investigate the best duration of treatment to maximize outcome
- assess the benefits of prevention with vaccination or screening programs
Thus, different types of trials are needed to cover these needs; these can be classified under the following headings:
Phases:
The pharmaceutical industry has adopted a specific trial classification
based on the four clinical phases of development of a particular drug (Phases I–IV). In Phase I,
manufacturers usually test the effects of a new drug in healthy
volunteers or patients unresponsive to usual therapies. They look at how
the drug is handled in the human body
(pharmacokinetics/pharmacodynamics), particularly with respect to the
immediate short-term safety of higher doses. Clinical trials in Phase II
examine dose–response curves in patients and what benefits might be
seen in a small group of patients with a particular disease. In Phase III,
a new drug is tested in a controlled fashion in a large patient
population against a placebo or standard therapy. This is a key phase,
where a drug will either make or break its reputation with respect to
safety and efficacy before marketing begins. A positive study in Phase III
is often known as a landmark study for a drug, through which it might
gain a license to be prescribed for a specific disease. A study in Phase
IV is often called a post-marketing study as the drug has already been
granted regulatory approval/license. These studies are crucial for
gathering additional safety information from a larger group of patients
in order to understand the long-term safety of the drug and appreciate
drug interactions.
Trial design:
Trials can be further classified by design. This classification is more
descriptive in terms of how patients are randomized to treatment. The
most common design is the parallel-group trial. Patients are randomized
to the new treatment or to the standard treatment and followed-up to
determine the effect of each treatment in parallel groups. Other trial
designs include, amongst others, crossover trials, factorial trials, and
cluster randomized trials.
Crossover trials randomize patients to different sequences of
treatments, but all patients eventually get all treatments in varying
order, i.e., the patient is his/her own control. Factorial trials
assign patients to more than one treatment-comparison group. These are
randomized in one trial at the same time, i.e., while drug A is being
tested against placebo, patients are re-randomized to drug B or placebo,
making four possible treatment combinations in total. Cluster randomized trials
are performed when larger groups (e.g., patients of a single
practitioner or hospital) are randomized instead of individual patients.
Number of centers:
Clinical trials can also be classified as single-center or multicenter studies according to the number of sites involved. While single-center studies are mainly used for Phase I and II studies, multicenter studies can be carried out at any stage of clinical development. Multicenter studies are necessary for two
major reasons:
Number of centers:
Clinical trials can also be classified as single-center or multicenter studies according to the number of sites involved. While single-center studies are mainly used for Phase I and II studies, multicenter studies can be carried out at any stage of clinical development. Multicenter studies are necessary for two
major reasons:
- to evaluate a new medication or procedure more efficiently in terms of accruing sufficient subjects over a shorter period of time
- to provide a better basis for the subsequent generalization of the trial’s findings, i.e., the effects of the treatment are evaluated in many types of centers.
Labels:
clinical
CLINICAL TRIAL PROTOCOL DEVELOPMENT AND KEY COMPONENTS OF TRIAL PROTOCOL
CLINICAL TRIAL PROTOCOL DEVELOPMENT
Once a clinical question has been postulated, the first step in the
conception of a clinical trial to answer that question is to develop a
trial protocol. A well-designed protocol reflects the scientific and
methodological integrity of a trial. Protocol development has evolved in
a complex way over the last 20 years to reflect the care and attention
given to undertaking clinical experiments with human volunteers,
reflecting the high standards of safety and ethics involved as well as
the complex statistical issues.
Questions addressed by a protocol:
Questions addressed by a protocol:
- What is the clinical question being asked by the trial?
- How should it be answered, in compliance with the standard ethical and regulatory requirements?
- What analyses should be performed in order to produce meaningful results?
- How will the results be presented?
Qualities of a good protocol:
CLINICAL TRIAL PROTOCOL
The contents of a trial protocol should generally include the following topics. However, site specific information may be provided on separate protocol page(s), or addressed in a separate agreement, and some of the information listed below may be contained in other protocol referenced documents, such as an Investigator’s Brochure.
General Information:
Background Information:
Trial Objectives and Purpose:
Trial Design:
The scientific integrity of the trial and the credibility of the data from the trial depend
substantially on the trial design. A description of the trial design, should include:
(b) Blinding.
Selection and Withdrawal of Subjects:
Treatment of Subjects:
Assessment of Efficacy:
Assessment of Safety:
Statistics:
Direct Access to Source Data/Documents:
Quality Control and Quality Assurance:
Ethics:
Data Handling and Record Keeping:
Financing and Insurance:
Publication Policy:
Supplements:
(NOTE: Since the protocol and the clinical trial/study report are closely related, further relevant information can be found in the ICH Guideline for Structure and Content of Clinical Study Reports.)
- Clear, comprehensive, easy to navigate, and unambiguous.
- Designed in accordance with the current principles of Good Clinical Practice and other regulatory requirements.
- Gives a sound scientific background of the trial.
- Clearly identifies the benefits and risks of being recruited into the trial.
- Plainly describes trial methodology and practicalities.
- Ensures that the rights, safety, and well-being of trial participants are not unduly compromised.
- Gives enough relevant information to make the trial and its results reproducible.
- Indicates all features that assure the quality of every aspect of the the trial.
CLINICAL TRIAL PROTOCOL
The contents of a trial protocol should generally include the following topics. However, site specific information may be provided on separate protocol page(s), or addressed in a separate agreement, and some of the information listed below may be contained in other protocol referenced documents, such as an Investigator’s Brochure.
General Information:
- Protocol title, protocol identifying number, and date. Any amendment(s) should also bear the amendment number(s) and date(s).
- Name and address of the sponsor and monitor (if other than the sponsor).
- Name and title of the person(s) authorized to sign the protocol and the protocol amendment(s) for the sponsor.
- Name, title, address, and telephone number(s) of the sponsor's medical expert (or dentist when appropriate) for the trial.
- Name and title of the investigator(s) who is (are) responsible for conducting the trial, and the address and telephone number(s) of the trial site(s).
- Name, title, address, and telephone number(s) of the qualified physician (or dentist, if applicable), who is responsible for all trial-site related medical (or dental) decisions (if other than investigator).
- Name(s) and address(es) of the clinical laboratory(ies) and other medical and/or technical department(s) and/or institutions involved in the trial.
Background Information:
- Name and description of the investigational product(s).
- A summary of findings from nonclinical studies that potentially have clinical significance and from clinical trials that are relevant to the trial.
- Summary of the known and potential risks and benefits, if any, to human subjects.
- Description of and justification for the route of administration, dosage, dosage regimen, and treatment period(s).
- A statement that the trial will be conducted in compliance with the protocol, GCP and the applicable regulatory requirement(s).
- Description of the population to be studied.
- References to literature and data that are relevant to the trial, and that provide background for the trial.
Trial Objectives and Purpose:
- A detailed description of the objectives and the purpose of the trial.
Trial Design:
The scientific integrity of the trial and the credibility of the data from the trial depend
substantially on the trial design. A description of the trial design, should include:
- A specific statement of the primary endpoints and the secondary endpoints, if any, to be measured during the trial.
- A description of the type/design of trial to be conducted (e.g. double-blind, placebo-controlled, parallel design) and a schematic diagram of trial design, procedures and stages.
- A description of the measures taken to minimize/avoid bias, including:
(b) Blinding.
- A description of the trial treatment(s) and the dosage and dosage regimen of the investigational product(s). Also include a description of the dosage form, packaging, and labelling of the investigational product(s).
- The expected duration of subject participation, and a description of the sequence and duration of all trial periods, including follow-up, if any.
- A description of the "stopping rules" or "discontinuation criteria" for individual subjects, parts of trial and entire trial.
- Accountability procedures for the investigational product(s), including the placebo(s) and comparator(s), if any.
- Maintenance of trial treatment randomization codes and procedures for breaking codes.
- The identification of any data to be recorded directly on the CRFs (i.e. no prior written or electronic record of data), and to be considered to be source data.
Selection and Withdrawal of Subjects:
- Subject inclusion criteria.
- Subject exclusion criteria.
- Subject withdrawal criteria (i.e. terminating investigational product treatment/trial treatment) and procedures specifying:
- When and how to withdraw subjects from the trial/ investigational product treatment.
- The type and timing of the data to be collected for withdrawn subjects.
- Whether and how subjects are to be replaced.
- The follow-up for subjects withdrawn from investigational product treatment/trial treatment.
Treatment of Subjects:
- The treatment(s) to be administered, including the name(s) of all the product(s), the dose(s), the dosing schedule(s), the route/mode(s) of administration, and the treatment period(s), including the follow-up period(s) for subjects for each investigational product treatment/trial treatment group/arm of the trial.
- Medication(s)/treatment(s) permitted (including rescue medication) and not permitted before and/or during the trial.
- Procedures for monitoring subject compliance.
Assessment of Efficacy:
- Specification of the efficacy parameters.
- Methods and timing for assessing, recording, and analysing of efficacy parameters.
Assessment of Safety:
- Specification of safety parameters.
- The methods and timing for assessing, recording, and analyzing safety parameters.
- Procedures for eliciting reports of and for recording and reporting adverse event and intercurrent illnesses.
- The type and duration of the follow-up of subjects after adverse events.
Statistics:
- A description of the statistical methods to be employed, including timing of any planned interim analysis(ses).
- The number of subjects planned to be enrolled. In multicentre trials, the numbers of enrolled subjects projected for each trial site should be specified. Reason for choice of sample size, including reflections on (or calculations of) the power of the trial and clinical justification.
- The level of significance to be used.
- Criteria for the termination of the trial.
- Procedure for accounting for missing, unused, and spurious data.
- Procedures for reporting any deviation(s) from the original statistical plan (any deviation(s) from the original statistical plan should be described and justified in protocol and/or in the final report, as appropriate).
- The selection of subjects to be included in the analyses (e.g. all randomized subjects, all dosed subjects, all eligible subjects, evaluable subjects).
Direct Access to Source Data/Documents:
- The sponsor should ensure that it is specified in the protocol or other written agreement that the investigator(s)/institution(s) will permit trial-related monitoring, audits, IRB/IEC review, and regulatory inspection(s), providing direct access to source data/documents.
Quality Control and Quality Assurance:
Ethics:
- Description of ethical considerations relating to the trial.
Data Handling and Record Keeping:
Financing and Insurance:
- Financing and insurance if not addressed in a separate agreement.
Publication Policy:
- Publication policy, if not addressed in a separate agreement.
Supplements:
(NOTE: Since the protocol and the clinical trial/study report are closely related, further relevant information can be found in the ICH Guideline for Structure and Content of Clinical Study Reports.)
Key components of a trial protocol
The trial protocol is a comprehensive document and the core structure of the protocol should be adapted according to the type of trial. ICH–GCP can be used as a reference document when developing a protocol for pharmaceutical clinical trials (Phase I to Phase IV) involving a pharmaceutical substance (the investigational medicinal product [IMP]). Most institutions and pharmaceutical companies use a standard set of rules to define the main protocol outline, structure, format, and naming/numbering methods for their trials. In this section, we briefly describe the main components of a typical protocol.
Protocol information page:
The front page gives the:
- trial title
- trial identification number
- protocol version number
- date prepared
The descriptive title of the protocol should be kept as short as possible, but at the same time it should reflect the design, type of population, and aim of the trial. ICH–GCP suggests that the title of a pharmaceutical trial should additionally include the medicinal product(s), the nature of the treatment (eg, treatment, prophylaxis, diagnosis, radiosensitizer), any comparator(s) and/or placebo(s), indication, and setting (outpatient or inpatient). The key investigational site, investigator, and sponsor should also be detailed on the title page.
Trial summary or synopsis:
A synopsis should provide the key aspects of the protocol in no more than two pages, and can be prepared in a table format. The main components of the protocol summary include:
full title
- principal investigator
- planned study dates
- objectives
- study design
- study population
- treatments
- procedures
- sample size
- outcome measures
- statistical methods
Labels:
clinical
Friday, 14 June 2013
Docking (molecular)
In the field of molecular modeling, docking is a method which predicts the preferred orientation of one molecule to a second when bound to each other to form a stable complex.[1] Knowledge of the preferred orientation in turn may be used to predict the strength of association or binding affinity between two molecules using for example scoring functions.
The associations between biologically relevant molecules such as proteins, nucleic acids, carbohydrates, and lipids play a central role in signal transduction. Furthermore, the relative orientation of the two interacting partners may affect the type of signal produced (e.g., agonism vs antagonism). Therefore docking is useful for predicting both the strength and type of signal produced.
Docking is frequently used to predict the binding orientation of small molecule drug candidates to their protein targets in order to in turn predict the affinity and activity of the small molecule. Hence docking plays an important role in the rational design of drugs.[2] Given the biological and pharmaceutical significance of molecular docking, considerable efforts have been directed towards improving the methods used to predict docking.
The focus of molecular docking is to computationally simulate the molecular recognition process. The aim of molecular docking is to achieve an optimized conformation for both the protein and ligand and relative orientation between protein and ligand such that the free energy of the overall system is minimized..
A variety of conformational search strategies have been applied to the ligand and to the receptor. These include:
Multiple static structures experimentally determined for the same protein in different conformations are often used to emulate receptor flexibility.[19] Alternatively rotamer libraries of amino acid side chains that surround the binding cavity may be searched to generate alternate but energetically reasonable protein conformations.[20][21]
Most scoring functions are physics-based molecular mechanics force fields that estimate the energy of the pose; a low (negative) energy indicates a stable system and thus a likely binding interaction. An alternative approach is to derive a statistical potential for interactions from a large database of protein-ligand complexes, such as the Protein Data Bank, and evaluate the fit of the pose according to this inferred potential.
There are a large number of structures from X-ray crystallography for complexes between proteins and high affinity ligands, but comparatively fewer for low affinity ligands as the later complexes tend to be less stable and therefore more difficult to crystallize. Scoring functions trained with this data can dock high affinity ligands correctly, but they will also give plausible docked conformations for ligands that do not bind. This gives a large number of false positive hits, i.e., ligands predicted to bind to the protein that actually don't when placed together in a test tube.
One way to reduce the number of false positives is to recalculate the energy of the top scoring poses using (potentially) more accurate but computationally more intensive techniques such as Generalized Born or Poisson-Boltzmann methods.[8]
The associations between biologically relevant molecules such as proteins, nucleic acids, carbohydrates, and lipids play a central role in signal transduction. Furthermore, the relative orientation of the two interacting partners may affect the type of signal produced (e.g., agonism vs antagonism). Therefore docking is useful for predicting both the strength and type of signal produced.
Docking is frequently used to predict the binding orientation of small molecule drug candidates to their protein targets in order to in turn predict the affinity and activity of the small molecule. Hence docking plays an important role in the rational design of drugs.[2] Given the biological and pharmaceutical significance of molecular docking, considerable efforts have been directed towards improving the methods used to predict docking.
Definition of problem
Molecular docking can be thought of as a problem of “lock-and-key”, where one is interested in finding the correct relative orientation of the “key” which will open up the “lock” (where on the surface of the lock is the key hole, which direction to turn the key after it is inserted, etc.). Here, the protein can be thought of as the “lock” and the ligand can be thought of as a “key”. Molecular docking may be defined as an optimization problem, which would describe the “best-fit” orientation of a ligand that binds to a particular protein of interest. However, since both the ligand and the protein are flexible, a “hand-in-glove” analogy is more appropriate than “lock-and-key”.[3] During the course of the process, the ligand and the protein adjust their conformation to achieve an overall “best-fit” and this kind of conformational adjustments resulting in the overall binding is referred to as “induced-fit”.[4]The focus of molecular docking is to computationally simulate the molecular recognition process. The aim of molecular docking is to achieve an optimized conformation for both the protein and ligand and relative orientation between protein and ligand such that the free energy of the overall system is minimized..
Docking approaches
Two approaches are particularly popular within the molecular docking community. One approach uses a matching technique that describes the protein and the ligand as complementary surfaces.[5][6][7] The second approach simulates the actual docking process in which the ligand-protein pairwise interaction energies are calculated.[8] Both approaches have significant advantages as well as some limitations. These are outlined below.Shape complementarity
Geometric matching/ shape complementarity methods describe the protein and ligand as a set of features that make them dockable.[9] These features may include molecular surface / complementary surface descriptors. In this case, the receptor’s molecular surface is described in terms of its solvent-accessible surface area and the ligand’s molecular surface is described in terms of its matching surface description. The complementarity between the two surfaces amounts to the shape matching description that may help finding the complementary pose of docking the target and the ligand molecules. Another approach is to describe the hydrophobic features of the protein using turns in the main-chain atoms. Yet another approach is to use a Fourier shape descriptor technique.[10][11][12] Whereas the shape complementarity based approaches are typically fast and robust, they cannot usually model the movements or dynamic changes in the ligand/ protein conformations accurately, although recent developments allow these methods to investigate ligand flexibility. Shape complementarity methods can quickly scan through several thousand ligands in a matter of seconds and actually figure out whether they can bind at the protein’s active site, and are usually scalable to even protein-protein interactions. They are also much more amenable to pharmacophore based approaches, since they use geometric descriptions of the ligands to find optimal binding.Simulation
The simulation of the docking process as such is a much more complicated process. In this approach, the protein and the ligand are separated by some physical distance, and the ligand finds its position into the protein’s active site after a certain number of “moves” in its conformational space. The moves incorporate rigid body transformations such as translations and rotations, as well as internal changes to the ligand’s structure including torsion angle rotations. Each of these moves in the conformation space of the ligand induces a total energetic cost of the system, and hence after every move the total energy of the system is calculated. The obvious advantage of the method is that it is more amenable to incorporate ligand flexibility into its modeling whereas shape complementarity techniques have to use some ingenious methods to incorporate flexibility in ligands. Another advantage is that the process is physically closer to what happens in reality, when the protein and ligand approach each other after molecular recognition. A clear disadvantage of this technique is that it takes longer time to evaluate the optimal pose of binding since they have to explore a rather large energy landscape. However grid-based techniques as well as fast optimization methods have significantly ameliorated these problems.Mechanics of docking
To perform a docking screen, the first requirement is a structure of the protein of interest. Usually the structure has been determined using a biophysical technique such as x-ray crystallography, or less often, NMR spectroscopy. This protein structure and a database of potential ligands serve as inputs to a docking program. The success of a docking program depends on two components: the search algorithm and the scoring function.Search algorithm
Main article: Searching the conformational space for docking
The search space in theory consists of all possible orientations and conformations
of the protein paired with the ligand. However in practice with current
computational resources, it is impossible to exhaustively explore the
search space—this would involve enumerating all possible distortions of
each molecule (molecules are dynamic and exist in an ensemble of
conformational states) and all possible rotational and translational orientations of the ligand relative to the protein at a given level of granularity.
Most docking programs in use account for a flexible ligand, and several
attempt to model a flexible protein receptor. Each "snapshot" of the
pair is referred to as a pose.A variety of conformational search strategies have been applied to the ligand and to the receptor. These include:
- systematic or stochastic torsional searches about rotatable bonds
- molecular dynamics simulations
- genetic algorithms to "evolve" new low energy conformations
Ligand flexibility
Conformations of the ligand may be generated in the absence of the receptor and subsequently docked[13] or conformations may be generated on-the-fly in the presence of the receptor binding cavity,[14] or with full rotational flexibility of every dihedral angle using fragment based docking.[15] Force field energy evaluation are most often used to select energetically reasonable conformations,[16] but knowledge-based methods have also been used.[17]Receptor flexibility
Computational capacity has increased dramatically over the last decade making possible the use of more sophisticated and computationally intensive methods in computer-assisted drug design. However, dealing with receptor flexibility in docking methodologies is still a thorny issue. The main reason behind this difficulty is the large number of degrees of freedom that have to be considered in this kind of calculations. Neglecting it, however, leads to poor docking results in terms of binding pose prediction.[18]Multiple static structures experimentally determined for the same protein in different conformations are often used to emulate receptor flexibility.[19] Alternatively rotamer libraries of amino acid side chains that surround the binding cavity may be searched to generate alternate but energetically reasonable protein conformations.[20][21]
Scoring function
Main article: Scoring functions for docking
The scoring function takes a pose as input and returns a number
indicating the likelihood that the pose represents a favorable binding
interaction.Most scoring functions are physics-based molecular mechanics force fields that estimate the energy of the pose; a low (negative) energy indicates a stable system and thus a likely binding interaction. An alternative approach is to derive a statistical potential for interactions from a large database of protein-ligand complexes, such as the Protein Data Bank, and evaluate the fit of the pose according to this inferred potential.
There are a large number of structures from X-ray crystallography for complexes between proteins and high affinity ligands, but comparatively fewer for low affinity ligands as the later complexes tend to be less stable and therefore more difficult to crystallize. Scoring functions trained with this data can dock high affinity ligands correctly, but they will also give plausible docked conformations for ligands that do not bind. This gives a large number of false positive hits, i.e., ligands predicted to bind to the protein that actually don't when placed together in a test tube.
One way to reduce the number of false positives is to recalculate the energy of the top scoring poses using (potentially) more accurate but computationally more intensive techniques such as Generalized Born or Poisson-Boltzmann methods.[8]
Applications
A binding interaction between a small molecule ligand and an enzyme protein may result in activation or inhibition of the enzyme. If the protein is a receptor, ligand binding may result in agonism or antagonism. Docking is most commonly used in the field of drug design — most drugs are small organic molecules, and docking may be applied to:- hit identification – docking combined with a scoring function can be used to quickly screen large databases of potential drugs in silico to identify molecules that are likely to bind to protein target of interest (see virtual screening).
- lead optimization – docking can be used to predict in where and in which relative orientation a ligand binds to a protein (also referred to as the binding mode or pose). This information may in turn be used to design more potent and selective analogs.
- Bioremediation – Protein ligand docking can also be used to predict pollutants that can be degraded by enzymes.[22]
- tutorials from rcmd.it - The use of Autodock and Autodock Vina is illustrated in a couple of tutorials prepared by Prof. Rino Ragno @Sapienza University. The tutorials are downloadable from www.rcmd.it
Chemometrics
Chemometrics is the science of extracting information from
chemical systems by data-driven means. It is a highly interfacial
discipline, using methods frequently employed in core data-analytic
disciplines such as multivariate statistics, applied mathematics, and computer science, in order to address problems in chemistry, biochemistry, medicine, biology and chemical engineering. In this way, it mirrors several other interfacial ‘-metrics’ such as psychometrics and econometrics.
Introduction
Chemometrics is applied to solve both descriptive and predictive problems in experimental life sciences, especially in chemistry. In descriptive applications, properties of chemical systems are modeled with the intent of learning the underlying relationships and structure of the system (i.e., model understanding and identification). In predictive applications, properties of chemical systems are modeled with the intent of predicting new properties or behavior of interest. In both cases, the datasets can be small but are often very large and highly complex, involving hundreds to thousands of variables, and hundreds to thousands of cases or observations.
Chemometric techniques are particularly heavily used in analytical chemistry and metabolomics, and the development of improved chemometric methods of analysis also continues to advance the state of the art in analytical instrumentation and methodology. It is an application driven discipline, and thus while the standard chemometric methodologies are very widely used industrially, academic groups are dedicated to the continued development of chemometric theory, method and application development.
Many early applications involved multivariate classification, numerous quantitative predictive applications followed, and by the late 1970s and early 1980s a wide variety of data- and computer-driven chemical analyses were occurring.
Multivariate analysis was a critical facet even in the earliest applications of chemometrics. The data resulting from infrared and UV/visible spectroscopy are often easily numbering in the thousands of measurements per sample. Mass spectrometry, nuclear magnetic resonance, atomic emission/absorption and chromatography experiments are also all by nature highly multivariate. The structure of these data was found to be conducive to using techniques such as principal components analysis (PCA), and partial least-squares (PLS). This is primarily because, while the datasets may be highly multivariate there is strong and often linear low-rank structure present. PCA and PLS have been shown over time very effective at empirically modeling the more chemically interesting low-rank structure, exploiting the interrelationships or ‘latent variables’ in the data, and providing alternative compact coordinate systems for further numerical analysis such as regression, clustering, and pattern recognition. Partial least squares in particular was heavily used in chemometric applications for many years before it began to find regular use in other fields.
Through the 1980s three dedicated journals appeared in the field: Journal of Chemometrics, Chemometrics and Intelligent Laboratory Systems, and Journal of Chemical Information and Modeling. These journals continue to cover both fundamental and methodological research in chemometrics. At present, most routine applications of existing chemometric methods are commonly published in application-oriented journals (e.g., Applied Spectroscopy, Analytical Chemistry, Anal. Chim. Acta., Talanta). Several important books/monographs on chemometrics were also first published in the 1980s, including the first edition of Malinowski’s "Factor Analysis in Chemistry",[2] Sharaf, Illman and Kowalski’s "Chemometrics",[3] Massart et al. "Chemometrics: a textbook",[4] and "Multivariate Calibration" by Martens and Naes.[5]
Some large chemometric application areas have gone on to represent new domains, such as molecular modeling and QSAR, cheminformatics, the ‘-omics’ fields of genomics, proteomics, metabonomics and metabolomics, process modeling and process analytical technology.
An account of the early history of chemometrics was published as a series of interviews by Geladi and Esbensen.[6][7]
Techniques in multivariate calibration are often broadly categorized as classical or inverse methods.[5][8] The principal difference between these approaches is that in classical calibration the models are solved such that they are optimal in describing the measured analytical responses (e.g., spectra) and can therefore be considered optimal descriptors, whereas in inverse methods the models are solved to be optimal in predicting the properties of interest (e.g., concentrations, optimal predictors).[9] Inverse methods usually require less physical knowledge of the chemical system, and at least in theory provide superior predictions in the mean-squared error sense,[10][11][12] and hence inverse approaches tend to be more frequently applied in contemporary multivariate calibration.
The main advantages of the use of multivariate calibration techniques is that fast, cheap, or non-destructive analytical measurements (such as optical spectroscopy) can be used to estimate sample properties which would otherwise require time-consuming, expensive or destructive testing (such as HPLC-MS). Equally important is that multivariate calibration allows for accurate quantitative analysis in the presence of heavy interference by other analytes. The selectivity of the analytical method is provided as much by the mathematical calibration, as the analytical measurement modalities. For example near-infrared spectra, which are extremely broad and non-selective compared to other analytical techniques (such as infrared or Raman spectra), can often be used successfully in conjunction with carefully developed multivariate calibration methods to predict concentrations of analytes in very complex matrices.
Unsupervised classification (also termed cluster analysis) is also commonly used to discover patterns in complex data sets, and again many of the core techniques used in chemometrics are common to other fields such as machine learning and statistical learning.
Signal processing is also a critical component of almost all chemometric applications, particularly the use of signal pretreatments to condition data prior to calibration or classification. The techniques employed commonly in chemometrics are often closely related to those used in related fields.[19]
Performance characterization, and figures of merit Like most arenas in the physical sciences, chemometrics is quantitatively oriented, so considerable emphasis is placed on performance characterization, model selection, verification & validation, and figures of merit. The performance of quantitative models is usually specified by root mean squared error in predicting the attribute of interest, and the performance of classifiers as a true-positive rate/false-positive rate pairs (or a full ROC curve). A recent report by Olivieri et al. provides a comprehensive overview of figures of merit and uncertainty estimation in multivariate calibration, including multivariate definitions of selectivity, sensitivity, SNR and prediction interval estimation.[20] Chemometric model selection usually involves the use of tools such as resampling (including bootstrap, permutation, cross-validation).
Multivariate statistical process control (MSPC), modeling and optimization accounts for a substantial amount of historical chemometric development.[21][22][23] Spectroscopy has been used successfully for online monitoring of manufacturing processes for 30–40 years, and this process data is highly amenable to chemometric modeling. Specifically in terms of MSPC, multiway modeling of batch and continuous processes is increasingly common in industry and remains an active area of research in chemometrics and chemical engineering. Process analytical chemistry as it was originally termed,[24] or the newer term process analytical technology continues to draw heavily on chemometric methods and MSPC.
Multiway methods are heavily used in chemometric applications.[25][26] These are higher-order extensions of more widely used methods. For example, while the analysis of a table (matrix, or second-order array) of data is routine in several fields, multiway methods are applied to data sets that involve 3rd, 4th, or higher-orders. Data of this type is very common in chemistry, for example a liquid-chromatography / mass spectrometry (LC-MS) system generates a large matrix of data (elution time versus m/z) for each sample analyzed. The data across multiple samples thus comprises a data cube. Batch process modeling involves data sets that have time vs. process variables vs. batch number. The multiway mathematical methods applied to these sorts of problems include PARAFAC, trilinear decomposition, and multiway PLS and PCA.
Introduction
Chemometrics is applied to solve both descriptive and predictive problems in experimental life sciences, especially in chemistry. In descriptive applications, properties of chemical systems are modeled with the intent of learning the underlying relationships and structure of the system (i.e., model understanding and identification). In predictive applications, properties of chemical systems are modeled with the intent of predicting new properties or behavior of interest. In both cases, the datasets can be small but are often very large and highly complex, involving hundreds to thousands of variables, and hundreds to thousands of cases or observations.
Chemometric techniques are particularly heavily used in analytical chemistry and metabolomics, and the development of improved chemometric methods of analysis also continues to advance the state of the art in analytical instrumentation and methodology. It is an application driven discipline, and thus while the standard chemometric methodologies are very widely used industrially, academic groups are dedicated to the continued development of chemometric theory, method and application development.
Origins
Although one could argue that even the earliest analytical experiments in chemistry involved a form of chemometrics, the field is generally recognized to have emerged in the 1970s as computers became increasingly exploited for scientific investigation. The term ‘chemometrics’ was coined by Svante Wold in a grant application 1971,[1] and the International Chemometrics Society was formed shortly thereafter by Svante Wold and Bruce Kowalski, two pioneers in the field. Wold was a professor of organic chemistry at Umeå University, Sweden, and Kowalski was a professor of analytical chemistry at University of Washington, Seattle.Many early applications involved multivariate classification, numerous quantitative predictive applications followed, and by the late 1970s and early 1980s a wide variety of data- and computer-driven chemical analyses were occurring.
Multivariate analysis was a critical facet even in the earliest applications of chemometrics. The data resulting from infrared and UV/visible spectroscopy are often easily numbering in the thousands of measurements per sample. Mass spectrometry, nuclear magnetic resonance, atomic emission/absorption and chromatography experiments are also all by nature highly multivariate. The structure of these data was found to be conducive to using techniques such as principal components analysis (PCA), and partial least-squares (PLS). This is primarily because, while the datasets may be highly multivariate there is strong and often linear low-rank structure present. PCA and PLS have been shown over time very effective at empirically modeling the more chemically interesting low-rank structure, exploiting the interrelationships or ‘latent variables’ in the data, and providing alternative compact coordinate systems for further numerical analysis such as regression, clustering, and pattern recognition. Partial least squares in particular was heavily used in chemometric applications for many years before it began to find regular use in other fields.
Through the 1980s three dedicated journals appeared in the field: Journal of Chemometrics, Chemometrics and Intelligent Laboratory Systems, and Journal of Chemical Information and Modeling. These journals continue to cover both fundamental and methodological research in chemometrics. At present, most routine applications of existing chemometric methods are commonly published in application-oriented journals (e.g., Applied Spectroscopy, Analytical Chemistry, Anal. Chim. Acta., Talanta). Several important books/monographs on chemometrics were also first published in the 1980s, including the first edition of Malinowski’s "Factor Analysis in Chemistry",[2] Sharaf, Illman and Kowalski’s "Chemometrics",[3] Massart et al. "Chemometrics: a textbook",[4] and "Multivariate Calibration" by Martens and Naes.[5]
Some large chemometric application areas have gone on to represent new domains, such as molecular modeling and QSAR, cheminformatics, the ‘-omics’ fields of genomics, proteomics, metabonomics and metabolomics, process modeling and process analytical technology.
An account of the early history of chemometrics was published as a series of interviews by Geladi and Esbensen.[6][7]
Techniques
Multivariate calibration
Many chemical problems and applications of chemometrics involve calibration. The objective is develop models which can be used to predict properties of interest based on measured properties of the chemical system, such as pressure, flow, temperature, infrared, Raman, NMR spectra and mass spectra. Examples include the development of multivariate models relating 1) multi-wavelength spectral response to analyte concentration, 2) molecular descriptors to biological activity, 3) multivariate process conditions/states to final product attributes. The process requires a calibration or training data set, which includes reference values for the properties of interest for prediction, and the measured attributes believed to correspond to these properties. For case 1), for example, one can assemble data from a number of samples, including concentrations for an analyte of interest for each sample (the reference) and the corresponding infrared spectrum of that sample. Multivariate calibration techniques such as partial-least squares regression, or principal component regression (and near countless other methods) are then used to construct a mathematical model that relates the multivariate response (spectrum) to the concentration of the analyte of interest, and such a model can be used to efficiently predict the concentrations of new samples.Techniques in multivariate calibration are often broadly categorized as classical or inverse methods.[5][8] The principal difference between these approaches is that in classical calibration the models are solved such that they are optimal in describing the measured analytical responses (e.g., spectra) and can therefore be considered optimal descriptors, whereas in inverse methods the models are solved to be optimal in predicting the properties of interest (e.g., concentrations, optimal predictors).[9] Inverse methods usually require less physical knowledge of the chemical system, and at least in theory provide superior predictions in the mean-squared error sense,[10][11][12] and hence inverse approaches tend to be more frequently applied in contemporary multivariate calibration.
The main advantages of the use of multivariate calibration techniques is that fast, cheap, or non-destructive analytical measurements (such as optical spectroscopy) can be used to estimate sample properties which would otherwise require time-consuming, expensive or destructive testing (such as HPLC-MS). Equally important is that multivariate calibration allows for accurate quantitative analysis in the presence of heavy interference by other analytes. The selectivity of the analytical method is provided as much by the mathematical calibration, as the analytical measurement modalities. For example near-infrared spectra, which are extremely broad and non-selective compared to other analytical techniques (such as infrared or Raman spectra), can often be used successfully in conjunction with carefully developed multivariate calibration methods to predict concentrations of analytes in very complex matrices.
Classification, pattern recognition, clustering
Supervised multivariate classification techniques are closely related to multivariate calibration techniques in that a calibration or training set is used to develop a mathematical model capable of classifying future samples. The techniques employed in chemometrics are similar to those used in other fields – multivariate discriminant analysis, logistic regression, neural networks, regression/classification trees. The use of rank reduction techniques in conjunction with these conventional classification methods is routine in chemometrics, for example discriminant analysis on principal components or partial least squares scores.Unsupervised classification (also termed cluster analysis) is also commonly used to discover patterns in complex data sets, and again many of the core techniques used in chemometrics are common to other fields such as machine learning and statistical learning.
Multivariate curve resolution
In chemometric parlance, multivariate curve resolution seeks to deconstruct data sets with limited or absent reference information and system knowledge. Some of the earliest work on these techniques was done by Lawton and Sylvestre in the early 1970s.[13][14] These approaches are also called self-modeling mixture analysis, blind source/signal separation, and spectral unmixing. For example, from a data set comprising fluorescence spectra from a series of samples each containing multiple fluorophores, multivariate curve resolution methods can be used to extract the fluorescence spectra of the individual fluorophores, along with their relative concentrations in each of the samples, essentially unmixing the total fluorescence spectrum into the contributions from the individual components. The problem is usually ill-determined due to rotational ambiguity (many possible solutions can equivalently represent the measured data), so the application of additional constraints is common, such as non-negatively, unmodality, or known interrelationships between the individual components (e.g., kinetic or mass-balance constraints).[15][16]Other techniques
Experimental design remains a core area of study in chemometrics and several monographs are specifically devoted to experimental design in chemical applications.[17][18] Sound principles of experimental design have been widely adopted within the chemometrics community, although many complex experiments are purely observational, and there can be little control over the properties and interrelationships of the samples and sample properties.Signal processing is also a critical component of almost all chemometric applications, particularly the use of signal pretreatments to condition data prior to calibration or classification. The techniques employed commonly in chemometrics are often closely related to those used in related fields.[19]
Performance characterization, and figures of merit Like most arenas in the physical sciences, chemometrics is quantitatively oriented, so considerable emphasis is placed on performance characterization, model selection, verification & validation, and figures of merit. The performance of quantitative models is usually specified by root mean squared error in predicting the attribute of interest, and the performance of classifiers as a true-positive rate/false-positive rate pairs (or a full ROC curve). A recent report by Olivieri et al. provides a comprehensive overview of figures of merit and uncertainty estimation in multivariate calibration, including multivariate definitions of selectivity, sensitivity, SNR and prediction interval estimation.[20] Chemometric model selection usually involves the use of tools such as resampling (including bootstrap, permutation, cross-validation).
Multivariate statistical process control (MSPC), modeling and optimization accounts for a substantial amount of historical chemometric development.[21][22][23] Spectroscopy has been used successfully for online monitoring of manufacturing processes for 30–40 years, and this process data is highly amenable to chemometric modeling. Specifically in terms of MSPC, multiway modeling of batch and continuous processes is increasingly common in industry and remains an active area of research in chemometrics and chemical engineering. Process analytical chemistry as it was originally termed,[24] or the newer term process analytical technology continues to draw heavily on chemometric methods and MSPC.
Multiway methods are heavily used in chemometric applications.[25][26] These are higher-order extensions of more widely used methods. For example, while the analysis of a table (matrix, or second-order array) of data is routine in several fields, multiway methods are applied to data sets that involve 3rd, 4th, or higher-orders. Data of this type is very common in chemistry, for example a liquid-chromatography / mass spectrometry (LC-MS) system generates a large matrix of data (elution time versus m/z) for each sample analyzed. The data across multiple samples thus comprises a data cube. Batch process modeling involves data sets that have time vs. process variables vs. batch number. The multiway mathematical methods applied to these sorts of problems include PARAFAC, trilinear decomposition, and multiway PLS and PCA.
Chemogenomics
Chemogenomics, or Chemical Genomics, is the systematic screening of targeted chemical libraries of small molecules against individual drug target families (e.g., GPCRs, nuclear receptors, kinases, proteases, etc.) with the ultimate goal of identification of novel drugs and drug targets.[1]
Typically some members of a target library have been well characterized
where both the function has been determined and compounds that modulate
the function of those targets (ligands in the case of receptors, inhibitors of enzymes, or blockers of ion channels)
have been identified. Other members of the target family may have
unknown function with no known ligands and hence are classified as orphan receptors.
By identifying screening hits that modulate the activity of the less
well characterized members of the target family, the function of these
novel targets can be elucidated. Furthermore the hits for these targets can be used as a starting point for drug discovery.
A common method to construct a targeted chemical library is to include known ligands of at least one and preferably several members of the target family. Since a portion of ligands that were designed and synthesized to bind to one family member will also bind to additional family members, the compounds contained in a targeted chemical library should collectively bind to a high percentage of the target family
A common method to construct a targeted chemical library is to include known ligands of at least one and preferably several members of the target family. Since a portion of ligands that were designed and synthesized to bind to one family member will also bind to additional family members, the compounds contained in a targeted chemical library should collectively bind to a high percentage of the target family
CHEMOINFORMATICS
Cheminformatics (also known as chemoinformatics and chemical informatics) is the use of computer and informational techniques applied to a range of problems in the field of chemistry. These in silico techniques are used in, for example, pharmaceutical companies in the process of drug discovery. These methods can also be used in chemical and allied industries in various other forms
Virtual libraries of classes of compounds (drugs, natural products, diversity-oriented synthetic products) were recently generated using the FOG (fragment optimized growth) algorithm. [8] This was done by using cheminformatic tools to train transition probabilities of a Markov chain on authentic classes of compounds, and then using the Markov chain to generate novel compounds that were similar to the training database.
History
The term chemoinformatics was defined by F.K. Brown [1][2] in 1998:Chemoinformatics is the mixing of those information resources to transform data into information and information into knowledge for the intended purpose of making better decisions faster in the area of drug lead identification and optimization.Since then, both spellings have been used, and some have evolved to be established as Cheminformatics,[3] while European Academia settled in 2006 for Chemoinformatics.[4] The recent establishment of the Journal of Cheminformatics is a strong push towards the shorter variant.
Basics
Cheminformatics combines the scientific working fields of chemistry, computer science and information science for example in the areas of topology, chemical graph theory, information retrieval and data mining in the chemical space.[5][6][7] Cheminformatics can also be applied to data analysis for various industries like paper and pulp, dyes and such allied industries.Applications
Storage and retrieval
Main article: Chemical data and databases
The primary application of cheminformatics is in the storage,
indexing and search of information relating to compounds. The efficient
search of such stored information includes topics that are dealt with in
computer science as data mining, information retrieval, information extraction and machine learning. Related research topics include:File formats
Main article: Chemical file format
The in silico representation of chemical structures uses specialized formats such as the XML-based Chemical Markup Language or SMILES. These representations are often used for storage in large chemical databases.
While some formats are suited for visual representations in 2 or 3
dimensions, others are more suited for studying physical interactions,
modeling and docking studies.Virtual libraries
Chemical data can pertain to real or virtual molecules. Virtual libraries of compounds may be generated in various ways to explore chemical space and hypothesize novel compounds with desired properties.Virtual libraries of classes of compounds (drugs, natural products, diversity-oriented synthetic products) were recently generated using the FOG (fragment optimized growth) algorithm. [8] This was done by using cheminformatic tools to train transition probabilities of a Markov chain on authentic classes of compounds, and then using the Markov chain to generate novel compounds that were similar to the training database.
Virtual screening
Main article: Virtual screening
In contrast to high-throughput screening, virtual screening involves computationally screening in silico libraries of compounds, by means of various methods such as docking, to identify members likely to possess desired properties such as biological activity against a given target. In some cases, combinatorial chemistry
is used in the development of the library to increase the efficiency in
mining the chemical space. More commonly, a diverse library of small
molecules or natural products is screened.Quantitative structure-activity relationship (QSAR)
Main article: Quantitative structure-activity relationship
This is the calculation of quantitative structure-activity relationship and quantitative structure property relationship
values, used to predict the activity of compounds from their
structures. In this context there is also a strong relationship to Chemometrics. Chemical expert systems are also relevant, since they represent parts of chemical knowledge as an in silico representation.Bioinformatics
In biology, bioinformatics i/ˌbaɪ.oʊˌɪnfərˈmætɪks/
is an interdisciplinary field that develops and improves upon methods
for storing, retrieving, organizing and analyzing biological data. A
major activity in bioinformatics is to develop software tools to
generate useful biological knowledge.
Bioinformatics has become an important part of many areas of biology. In experimental molecular biology, bioinformatics techniques such as image and signal processing allow extraction of useful results from large amounts of raw data. In the field of genetics and genomics, it aids in sequencing and annotating genomes and their observed mutations. It plays a role in the textual mining of biological literature and the development of biological and gene ontologies to organize and query biological data. It plays a role in the analysis of gene and protein expression and regulation. Bioinformatics tools aid in the comparison of genetic and genomic data and more generally in the understanding of evolutionary aspects of molecular biology. At a more integrative level, it helps analyze and catalogue the biological pathways and networks that are an important part of systems biology. In structural biology, it aids in the simulation and modeling of DNA, RNA, and protein structures as well as molecular interactions.
Bioinformatics uses many areas of computer science, mathematics and engineering to process biological data. Complex machines are used to read in biological data at a much faster rate than before. Databases and information systems are used to store and organize biological data. Analyzing biological data may involve algorithms in artificial intelligence, soft computing, data mining, image processing, and simulation. The algorithms in turn depend on theoretical foundations such as discrete mathematics, control theory, system theory, information theory, and statistics. Commonly used software tools and technologies in the field include Java, C#, XML, Perl, C, C++, Python, R, SQL, CUDA, MATLAB, and spreadsheet applications.
One early contributor to bioinformatics was Elvin A. Kabat, who pioneered biological sequence analysis in 1970 with his comprehensive volumes of antibody sequences released with Tai Te Wu between 1980 and 1991.[7] Another significant pioneer in the field was Margaret Oakley Dayhoff, who has been hailed by David Lipman, director of the National Center for Biotechnology Information, as the "mother and father of bioinformatics."[8]
At the beginning of the "genomic revolution", the term bioinformatics was re-discovered to refer to the creation and maintenance of a database to store biological information such as nucleotide sequences and amino acid sequences. Development of this type of database involved not only design issues but the development of complex interfaces whereby researchers could access existing data as well as submit new or revised data.
Bioinformatics now entails the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data.
Over the past few decades rapid developments in genomic and other molecular research technologies and developments in information technologies have combined to produce a tremendous amount of information related to molecular biology. Bioinformatics is the name given to these mathematical and computing approaches used to glean understanding of biological processes.
There are two fundamental ways of modelling a Biological system (e.g., living cell) both coming under Bioinformatic approaches.
Another aspect of bioinformatics in sequence analysis is annotation. This involves computational gene finding to search for protein-coding genes, RNA genes, and other functional sequences within a genome. Not all of the nucleotides within a genome are part of genes. Within the genomes of higher organisms, large parts of the DNA do not serve any obvious purpose. This so-called junk DNA may, however, contain unrecognized functional elements. Bioinformatics helps to bridge the gap between genome and proteome projects — for example, in the use of DNA sequences for protein identification.
The area of research within computer science that uses genetic algorithms is sometimes confused with computational evolutionary biology, but the two areas are not necessarily related.
Another type of data that requires novel informatics development is the analysis of lesions found to be recurrent among many tumors.
Many of these studies are based on the homology detection and protein families computation.
Systems biology involves the use of computer simulations of cellular subsystems (such as the networks of metabolites and enzymes which comprise metabolism, signal transduction pathways and gene regulatory networks) to both analyze and visualize the complex connections of these cellular processes. Artificial life or virtual evolution attempts to understand evolutionary processes via the computer simulation of simple (artificial) life forms.
One of the key ideas in bioinformatics is the notion of homology. In the genomic branch of bioinformatics, homology is used to predict the function of a gene: if the sequence of gene A, whose function is known, is homologous to the sequence of gene B, whose function is unknown, one could infer that B may share A's function. In the structural branch of bioinformatics, homology is used to determine which parts of a protein are important in structure formation and interaction with other proteins. In a technique called homology modeling, this information is used to predict the structure of a protein once the structure of a homologous protein is known. This currently remains the only way to predict protein structures reliably.
One example of this is the similar protein homology between hemoglobin in humans and the hemoglobin in legumes (leghemoglobin). Both serve the same purpose of transporting oxygen in the organism. Though both of these proteins have completely different amino acid sequences, their protein structures are virtually identical, which reflects their near identical purposes.
Other techniques for predicting protein structure include protein threading and de novo (from scratch) physics-based modeling.
See also: structural motif and structural domain.
Molecular dynamic simulation of movement of atoms about rotatable bonds is the fundamental principle behind computational algorithms, termed docking algorithms for studying molecular interactions.
See also: protein–protein interaction prediction.
The range of open-source software packages includes titles such as Bioconductor, BioPerl, Biopython, BioJava, BioRuby, Bioclipse, EMBOSS, Taverna workbench, and UGENE. In order to maintain this tradition and create further opportunities, the non-profit Open Bioinformatics Foundation[16] have supported the annual Bioinformatics Open Source Conference (BOSC) since 2000.[17]
Basic bioinformatics services are classified by the EBI into three categories: SSS (Sequence Search Services), MSA (Multiple Sequence Alignment), and BSA (Biological Sequence Analysis).[citation needed] The availability of these service-oriented bioinformatics resources demonstrate the applicability of web-based bioinformatics solutions, and range from a collection of standalone tools with a common data format under a single, standalone or web-based interface, to integrative, distributed and extensible bioinformatics workflow management systems.
Rosalind is a joint project between the University of California at San Diego and Saint Petersburg Academic University along with the Russian Academy of Sciences. The project's name commemorates Rosalind Franklin, whose X-ray crystallography with Raymond Gosling facilitated the discovery of the DNA double helix by James D. Watson and Francis Crick. It was recognized by Homolog.us as the Best Educational Resource of 2012 in their review of the Top Bioinformatics Contributions of 2012. As of May 2013, it hosts over 6,000 active users.
Rosalind will be used to teach the first Bioinformatics Algorithms course on Coursera in 2013.[23]
Bioinformatics has become an important part of many areas of biology. In experimental molecular biology, bioinformatics techniques such as image and signal processing allow extraction of useful results from large amounts of raw data. In the field of genetics and genomics, it aids in sequencing and annotating genomes and their observed mutations. It plays a role in the textual mining of biological literature and the development of biological and gene ontologies to organize and query biological data. It plays a role in the analysis of gene and protein expression and regulation. Bioinformatics tools aid in the comparison of genetic and genomic data and more generally in the understanding of evolutionary aspects of molecular biology. At a more integrative level, it helps analyze and catalogue the biological pathways and networks that are an important part of systems biology. In structural biology, it aids in the simulation and modeling of DNA, RNA, and protein structures as well as molecular interactions.
Bioinformatics uses many areas of computer science, mathematics and engineering to process biological data. Complex machines are used to read in biological data at a much faster rate than before. Databases and information systems are used to store and organize biological data. Analyzing biological data may involve algorithms in artificial intelligence, soft computing, data mining, image processing, and simulation. The algorithms in turn depend on theoretical foundations such as discrete mathematics, control theory, system theory, information theory, and statistics. Commonly used software tools and technologies in the field include Java, C#, XML, Perl, C, C++, Python, R, SQL, CUDA, MATLAB, and spreadsheet applications.
Introduction
History
Building on the recognition of the importance of information transmission, accumulation and processing in biological systems, in 1970 Paulien Hogeweg, coined the term "Bioinformatics" to refer to the study of information processes in biotic systems.[4][5][6] This definition placed bioinformatics as a field parallel to biophysics or biochemistry (biochemistry is the study of chemical processes in biological systems).[4] Examples of relevant biological information processes studied in the early days of bioinformatics are the formation of complex social interaction structures by simple behavioral rules, and the information accumulation and maintenance in models of prebiotic evolution.One early contributor to bioinformatics was Elvin A. Kabat, who pioneered biological sequence analysis in 1970 with his comprehensive volumes of antibody sequences released with Tai Te Wu between 1980 and 1991.[7] Another significant pioneer in the field was Margaret Oakley Dayhoff, who has been hailed by David Lipman, director of the National Center for Biotechnology Information, as the "mother and father of bioinformatics."[8]
At the beginning of the "genomic revolution", the term bioinformatics was re-discovered to refer to the creation and maintenance of a database to store biological information such as nucleotide sequences and amino acid sequences. Development of this type of database involved not only design issues but the development of complex interfaces whereby researchers could access existing data as well as submit new or revised data.
Goals
In order to study how normal cellular activities are altered in different disease states, the biological data must be combined to form a comprehensive picture of these activities. Therefore, the field of bioinformatics has evolved such that the most pressing task now involves the analysis and interpretation of various types of data. This includes nucleotide and amino acid sequences, protein domains, and protein structures.[9] The actual process of analyzing and interpreting data is referred to as computational biology. Important sub-disciplines within bioinformatics and computational biology include:- the development and implementation of tools that enable efficient access to, use and management of, various types of information.
- the development of new algorithms (mathematical formulas) and statistics with which to assess relationships among members of large data sets. For example, methods to locate a gene within a sequence, predict protein structure and/or function, and cluster protein sequences into families of related sequences.
Bioinformatics now entails the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data.
Over the past few decades rapid developments in genomic and other molecular research technologies and developments in information technologies have combined to produce a tremendous amount of information related to molecular biology. Bioinformatics is the name given to these mathematical and computing approaches used to glean understanding of biological processes.
Approaches
Common activities in bioinformatics include mapping and analyzing DNA and protein sequences, aligning different DNA and protein sequences to compare them, and creating and viewing 3-D models of protein structures.There are two fundamental ways of modelling a Biological system (e.g., living cell) both coming under Bioinformatic approaches.
- Static
- Sequences – Proteins, Nucleic acids and Peptides
- Interaction data among the above entities including microarray data and Networks of proteins, metabolites
- Dynamic
- Structures – Proteins, Nucleic acids, Ligands (including metabolites and drugs) and Peptides (structures studied with bioinformatics tools are not considered static anymore and their dynamics is often the core of the structural studies)
- Systems Biology comes under this category including reaction fluxes and variable concentrations of metabolites
- Multi-Agent Based modelling approaches capturing cellular events such as signalling, transcription and reaction dynamics
Major research areas
Sequence analysis
Main articles: Sequence alignment and Sequence database
Since the Phage Φ-X174 was sequenced in 1977,[10] the DNA sequences
of thousands of organisms have been decoded and stored in databases.
This sequence information is analyzed to determine genes that encode polypeptides (proteins), RNA genes, regulatory sequences, structural motifs, and repetitive sequences. A comparison of genes within a species or between different species can show similarities between protein functions, or relations between species (the use of molecular systematics to construct phylogenetic trees). With the growing amount of data, it long ago became impractical to analyze DNA sequences manually. Today, computer programs such as BLAST are used daily to search sequences from more than 260 000 organisms, containing over 190 billion nucleotides.[11]
These programs can compensate for mutations (exchanged, deleted or
inserted bases) in the DNA sequence, to identify sequences that are
related, but not identical. A variant of this sequence alignment is used in the sequencing process itself. The so-called shotgun sequencing technique (which was used, for example, by The Institute for Genomic Research to sequence the first bacterial genome, Haemophilus influenzae)[12]
does not produce entire chromosomes. Instead it generates the sequences
of many thousands of small DNA fragments (ranging from 35 to 900
nucleotides long, depending on the sequencing technology). The ends of
these fragments overlap and, when aligned properly by a genome assembly
program, can be used to reconstruct the complete genome. Shotgun
sequencing yields sequence data quickly, but the task of assembling the
fragments can be quite complicated for larger genomes. For a genome as
large as the human genome,
it may take many days of CPU time on large-memory, multiprocessor
computers to assemble the fragments, and the resulting assembly will
usually contain numerous gaps that have to be filled in later. Shotgun
sequencing is the method of choice for virtually all genomes sequenced
today, and genome assembly algorithms are a critical area of
bioinformatics research.Another aspect of bioinformatics in sequence analysis is annotation. This involves computational gene finding to search for protein-coding genes, RNA genes, and other functional sequences within a genome. Not all of the nucleotides within a genome are part of genes. Within the genomes of higher organisms, large parts of the DNA do not serve any obvious purpose. This so-called junk DNA may, however, contain unrecognized functional elements. Bioinformatics helps to bridge the gap between genome and proteome projects — for example, in the use of DNA sequences for protein identification.
Genome annotation
Main article: Gene finding
In the context of genomics, annotation
is the process of marking the genes and other biological features in a
DNA sequence. The first genome annotation software system was designed
in 1995 by Owen White, who was part of the team at The Institute for Genomic Research that sequenced and analyzed the first genome of a free-living organism to be decoded, the bacterium Haemophilus influenzae.
White built a software system to find the genes (fragments of genomic
sequence that encode proteins), the transfer RNAs, and to make initial
assignments of function to those genes. Most current genome annotation
systems work similarly, but the programs available for analysis of
genomic DNA, such as the GeneMark program trained and used to find
protein-coding genes in Haemophilus influenzae, are constantly changing and improving.Computational evolutionary biology
Evolutionary biology is the study of the origin and descent of species, as well as their change over time. Informatics has assisted evolutionary biologists in several key ways; it has enabled researchers to:- trace the evolution of a large number of organisms by measuring changes in their DNA, rather than through physical taxonomy or physiological observations alone,
- more recently, compare entire genomes, which permits the study of more complex evolutionary events, such as gene duplication, horizontal gene transfer, and the prediction of factors important in bacterial speciation,
- build complex computational models of populations to predict the outcome of the system over time [13]
- track and share information on an increasingly large number of species and organisms
The area of research within computer science that uses genetic algorithms is sometimes confused with computational evolutionary biology, but the two areas are not necessarily related.
Literature analysis
Main article: Text mining
The growth in the number of published literature makes it virtually
impossible to read every paper, resulting in disjointed sub-fields of
research. Literature analysis aims to employ computational and
statistical linguistics to mine this growing library of text resources.
For example:- abbreviation recognition – identify the long-form and abbreviation of biological terms,
- named entity recognition – recognizing biological terms such as gene names
- protein-protein interaction – identify which proteins interact with which proteins from text
Analysis of gene expression
The expression of many genes can be determined by measuring mRNA levels with multiple techniques including microarrays, expressed cDNA sequence tag (EST) sequencing, serial analysis of gene expression (SAGE) tag sequencing, massively parallel signature sequencing (MPSS), RNA-Seq, also known as "Whole Transcriptome Shotgun Sequencing" (WTSS), or various applications of multiplexed in-situ hybridization. All of these techniques are extremely noise-prone and/or subject to bias in the biological measurement, and a major research area in computational biology involves developing statistical tools to separate signal from noise in high-throughput gene expression studies. Such studies are often used to determine the genes implicated in a disorder: one might compare microarray data from cancerous epithelial cells to data from non-cancerous cells to determine the transcripts that are up-regulated and down-regulated in a particular population of cancer cells.Analysis of regulation
Regulation is the complex orchestration of events starting with an extracellular signal such as a hormone and leading to an increase or decrease in the activity of one or more proteins. Bioinformatics techniques have been applied to explore various steps in this process. For example, promoter analysis involves the identification and study of sequence motifs in the DNA surrounding the coding region of a gene. These motifs influence the extent to which that region is transcribed into mRNA. Expression data can be used to infer gene regulation: one might compare microarray data from a wide variety of states of an organism to form hypotheses about the genes involved in each state. In a single-cell organism, one might compare stages of the cell cycle, along with various stress conditions (heat shock, starvation, etc.). One can then apply clustering algorithms to that expression data to determine which genes are co-expressed. For example, the upstream regions (promoters) of co-expressed genes can be searched for over-represented regulatory elements. Examples of clustering algorithms applied in gene clustering are k-means clustering, self-organizing maps (SOMs), hierarchical clustering, and consensus clustering methods such as the Bi-CoPaM. The later, namely Bi-CoPaM, has been actually proposed to address various issues specific to gene discovery problems such as consistent co-expression of genes over multiple microarray datasets.[14][15]Analysis of protein expression
Protein microarrays and high throughput (HT) mass spectrometry (MS) can provide a snapshot of the proteins present in a biological sample. Bioinformatics is very much involved in making sense of protein microarray and HT MS data; the former approach faces similar problems as with microarrays targeted at mRNA, the latter involves the problem of matching large amounts of mass data against predicted masses from protein sequence databases, and the complicated statistical analysis of samples where multiple, but incomplete peptides from each protein are detected.Analysis of mutations in cancer
In cancer, the genomes of affected cells are rearranged in complex or even unpredictable ways. Massive sequencing efforts are used to identify previously unknown point mutations in a variety of genes in cancer. Bioinformaticians continue to produce specialized automated systems to manage the sheer volume of sequence data produced, and they create new algorithms and software to compare the sequencing results to the growing collection of human genome sequences and germline polymorphisms. New physical detection technologies are employed, such as oligonucleotide microarrays to identify chromosomal gains and losses (called comparative genomic hybridization), and single-nucleotide polymorphism arrays to detect known point mutations. These detection methods simultaneously measure several hundred thousand sites throughout the genome, and when used in high-throughput to measure thousands of samples, generate terabytes of data per experiment. Again the massive amounts and new types of data generate new opportunities for bioinformaticians. The data is often found to contain considerable variability, or noise, and thus Hidden Markov model and change-point analysis methods are being developed to infer real copy number changes.Another type of data that requires novel informatics development is the analysis of lesions found to be recurrent among many tumors.
Comparative genomics
Main article: Comparative genomics
The core of comparative genome analysis is the establishment of the correspondence between genes (orthology
analysis) or other genomic features in different organisms. It is these
intergenomic maps that make it possible to trace the evolutionary
processes responsible for the divergence of two genomes. A multitude of
evolutionary events acting at various organizational levels shape genome
evolution. At the lowest level, point mutations affect individual
nucleotides. At a higher level, large chromosomal segments undergo
duplication, lateral transfer, inversion, transposition, deletion and
insertion. Ultimately, whole genomes are involved in processes of
hybridization, polyploidization and endosymbiosis,
often leading to rapid speciation. The complexity of genome evolution
poses many exciting challenges to developers of mathematical models and
algorithms, who have recourse to a spectra of algorithmic, statistical
and mathematical techniques, ranging from exact, heuristics, fixed parameter and approximation algorithms for problems based on parsimony models to Markov Chain Monte Carlo algorithms for Bayesian analysis of problems based on probabilistic models.Many of these studies are based on the homology detection and protein families computation.
Network and systems biology
Network analysis seeks to understand the relationships within biological networks such as metabolic or protein-protein interaction networks. Although biological networks can be constructed from a single type of molecule or entity (such as genes), network biology often attempts to integrate many different data types, such as proteins, small molecules, gene expression data, and others, which are all connected physically and/or functionally.Systems biology involves the use of computer simulations of cellular subsystems (such as the networks of metabolites and enzymes which comprise metabolism, signal transduction pathways and gene regulatory networks) to both analyze and visualize the complex connections of these cellular processes. Artificial life or virtual evolution attempts to understand evolutionary processes via the computer simulation of simple (artificial) life forms.
High-throughput image analysis
Computational technologies are used to accelerate or fully automate the processing, quantification and analysis of large amounts of high-information-content biomedical imagery. Modern image analysis systems augment an observer's ability to make measurements from a large or complex set of images, by improving accuracy, objectivity, or speed. A fully developed analysis system may completely replace the observer. Although these systems are not unique to biomedical imagery, biomedical imaging is becoming more important for both diagnostics and research. Some examples are:- high-throughput and high-fidelity quantification and sub-cellular localization (high-content screening, cytohistopathology, Bioimage informatics)
- morphometrics
- clinical image analysis and visualization
- determining the real-time air-flow patterns in breathing lungs of living animals
- quantifying occlusion size in real-time imagery from the development of and recovery during arterial injury
- making behavioral observations from extended video recordings of laboratory animals
- infrared measurements for metabolic activity determination
- inferring clone overlaps in DNA mapping, e.g. the Sulston score
Structural bioinformatic approaches
Prediction of protein structure
Main article: Protein structure prediction
Protein structure prediction is another important application of bioinformatics. The amino acid sequence of a protein, the so-called primary structure,
can be easily determined from the sequence on the gene that codes for
it. In the vast majority of cases, this primary structure uniquely
determines a structure in its native environment. (Of course, there are
exceptions, such as the bovine spongiform encephalopathy – a.k.a. Mad Cow Disease – prion.)
Knowledge of this structure is vital in understanding the function of
the protein. For lack of better terms, structural information is usually
classified as one of secondary, tertiary and quaternary
structure. A viable general solution to such predictions remains an
open problem. Most efforts have so far been directed towards heuristics
that work most of the time.One of the key ideas in bioinformatics is the notion of homology. In the genomic branch of bioinformatics, homology is used to predict the function of a gene: if the sequence of gene A, whose function is known, is homologous to the sequence of gene B, whose function is unknown, one could infer that B may share A's function. In the structural branch of bioinformatics, homology is used to determine which parts of a protein are important in structure formation and interaction with other proteins. In a technique called homology modeling, this information is used to predict the structure of a protein once the structure of a homologous protein is known. This currently remains the only way to predict protein structures reliably.
One example of this is the similar protein homology between hemoglobin in humans and the hemoglobin in legumes (leghemoglobin). Both serve the same purpose of transporting oxygen in the organism. Though both of these proteins have completely different amino acid sequences, their protein structures are virtually identical, which reflects their near identical purposes.
Other techniques for predicting protein structure include protein threading and de novo (from scratch) physics-based modeling.
See also: structural motif and structural domain.
Molecular Interaction
Efficient software is available today for studying interactions among proteins, ligands and peptides. Types of interactions most often encountered in the field include – Protein–ligand (including drug), protein–protein and protein–peptide.Molecular dynamic simulation of movement of atoms about rotatable bonds is the fundamental principle behind computational algorithms, termed docking algorithms for studying molecular interactions.
See also: protein–protein interaction prediction.
Docking algorithms
Main article: Protein–protein docking
In the last two decades, tens of thousands of protein three-dimensional structures have been determined by X-ray crystallography and Protein nuclear magnetic resonance spectroscopy
(protein NMR). One central question for the biological scientist is
whether it is practical to predict possible protein–protein interactions
only based on these 3D shapes, without doing protein–protein interaction experiments. A variety of methods have been developed to tackle the Protein–protein docking problem, though it seems that there is still much work to be done in this field.Software and tools
Software tools for bioinformatics range from simple command-line tools, to more complex graphical programs and standalone web-services available from various bioinformatics companies or public institutions.Open-source bioinformatics software
Many free and open-source software tools have existed and continued to grow since the 1980s.[16] The combination of a continued need for new algorithms for the analysis of emerging types of biological readouts, the potential for innovative in silico experiments, and freely available open code bases have helped to create opportunities for all research groups to contribute to both bioinformatics and the range of open-source software available, regardless of their funding arrangements. The open source tools often act as incubators of ideas, or community-supported plug-ins in commercial applications. They may also provide de facto standards and shared object models for assisting with the challenge of bioinformation integration.The range of open-source software packages includes titles such as Bioconductor, BioPerl, Biopython, BioJava, BioRuby, Bioclipse, EMBOSS, Taverna workbench, and UGENE. In order to maintain this tradition and create further opportunities, the non-profit Open Bioinformatics Foundation[16] have supported the annual Bioinformatics Open Source Conference (BOSC) since 2000.[17]
Web services in bioinformatics
SOAP- and REST-based interfaces have been developed for a wide variety of bioinformatics applications allowing an application running on one computer in one part of the world to use algorithms, data and computing resources on servers in other parts of the world. The main advantages derive from the fact that end users do not have to deal with software and database maintenance overheads.Basic bioinformatics services are classified by the EBI into three categories: SSS (Sequence Search Services), MSA (Multiple Sequence Alignment), and BSA (Biological Sequence Analysis).[citation needed] The availability of these service-oriented bioinformatics resources demonstrate the applicability of web-based bioinformatics solutions, and range from a collection of standalone tools with a common data format under a single, standalone or web-based interface, to integrative, distributed and extensible bioinformatics workflow management systems.
Bioinformatics workflow management systems
Main article: Bioinformatics workflow management systems
A Bioinformatics workflow management system is a specialized form of a workflow management system
designed specifically to compose and execute a series of computational
or data manipulation steps, or a workflow, in a Bioinformatics
application. Such systems are designed to- provide an easy-to-use environment for individual application scientists themselves to create their own workflows
- provide interactive tools for the scientists enabling them to execute their workflows and view their results in real-time
- simplify the process of sharing and reusing workflows between the scientists.
- enable scientists to track the provenance of the workflow execution results and the workflow creation steps.
Rosalind
Rosalind is an educational resource and web project for learning bioinformatics through problem solving and computer programming.[18][19][20][21][22] Rosalind users learn bioinformatics concepts through a problem tree that builds up biological, algorithmic, and programming knowledge concurrently. Each problem is checked automatically, allowing for the project to also be used for automated homework testing in existing classes.Rosalind is a joint project between the University of California at San Diego and Saint Petersburg Academic University along with the Russian Academy of Sciences. The project's name commemorates Rosalind Franklin, whose X-ray crystallography with Raymond Gosling facilitated the discovery of the DNA double helix by James D. Watson and Francis Crick. It was recognized by Homolog.us as the Best Educational Resource of 2012 in their review of the Top Bioinformatics Contributions of 2012. As of May 2013, it hosts over 6,000 active users.
Rosalind will be used to teach the first Bioinformatics Algorithms course on Coursera in 2013.[23]
Conferences
There are several large conferences that are concerned with bioinformatics. Some of the most notable examples are Intelligent Systems for Molecular Biology (ISMB), European Conference on Computational Biology (ECCB), Research in Computational Molecular Biology (RECOMB) and American Society of Mass Spectrometry (ASMS).
Subscribe to:
Posts (Atom)