Saturday, 29 June 2013

Clinical Research: An Introduction

Clinical Research - An Introduction

INTRODUCTION

We define a clinical trial as a prospective study comparing the effect and value of intervention(s) against a control in human beings. Note that a clinical trial is prospective, rather than retrospective. Study participants must be followed forward in time. Each participant however, must be followed from a well-defined point in time, which becomes time zero or baseline for the study. A clinical trial must employ one or more intervention techniques. These may be single or combinations of diagnostic, preventive, or therapeutic drugs, biologics, devices, regimens, or procedures. Intervention techniques should be applied to participants in a standard fashion in an effort to change some aspect. Follow-up of people over a period of time without active intervention may measure the natural history of a disease process, but it does not constitute a clinical trial. Without active intervention the study is observational because no experiment is being performed. Early phase studies may be controlled or uncontrolled. Although common terminology refers to phase I and phase II trials, because they are sometimes uncontrolled. A trial contains a control group against which the intervention group is compared. At baseline, the control group must be sufficiently similar in relevant respects to the intervention group in order that differences in outcome may reasonably be attributed to the action of the intervention. Most often a new intervention is compared with, or used along with, best current standard therapy. Each clinical trial must incorporate participant safety considerations into its basic design. Equally important is the need for, and responsibility of, the investigator to fully inform potential participants about the trial, including information about potential benefits, harms, and treatment alternatives. Unlike animal studies, in clinical trials the investigator cannot dictate what an individual should do. He can only strongly encourage participants to avoid certain medications or procedures which might interfere with the trial. Since it may be impossible to have “pure” intervention and control groups, an investigator may not be able to compare interventions, but only intervention strategies. Strategies refer to attempts at getting all participants to adhere, to the best of their ability, to their originally assigned intervention. When planning a trial, the investigator should recognize the difficulties inherent in studies with human subjects and attempt to estimate the magnitude of participants’ failure to adhere strictly to the protocol. The ideal clinical trial is one that is randomized and double-blind. Deviation from this standard has potential drawbacks. In some clinical trials, compromise is unavoidable, but often deficiencies can be prevented by adhering to fundamental features of design, conduct, and analysis.

Thursday, 20 June 2013

Fundamental Of Trial Design : Randomized Controlled Trials

INTRODUCTION

Randomized clinical trials are scientific investigations that examine and evaluate the safety and efficacy of new drugs or therapeutic procedures using human subjects. The results that these studies generate are considered to be the most valued data in the era of evidence-based medicine. Understanding the principles behind clinical trials enables an appreciation of the validity and reliability of their results.

What is a randomized clinical trial?

A clinical trial evaluates the effect of a new drug (or device or procedure) on human volunteers. These trials can be used to evaluate the safety of a new drug in healthy human volunteers, or to assess treatment benefits in patients with a specific disease. Clinical trials can compare a new drug against existing drugs or against dummy medications (placebo) or they may not have a comparison arm. A large proportion of clinical trials are sponsored by pharmaceutical or biotechnology companies who are developing the new drug, but some studies using older drugs in new disease areas are funded by health related government agencies, or through charitable grants.
In a randomized clinical trial, patients and trial personnel are deliberately kept unaware of which patient is on the new drug. This minimizes bias in the later evaluation so that the initial blind random allocation of patients to one or other treatment group is preserved throughout the trial. Clinical trials must be designed in an ethical manner so that patients are not denied the benefit of usual treatments. Patients must give their voluntary consent that they appreciate the purpose of the trial. Several key guidelines regarding the ethics, conduct, and reporting of clinical trials have been constructed to ensure that a patient’s rights and safety are not compromised by participating in clinical trials.

Are there different types of clinical trials?

Clinical trials vary depending on who is conducting the trial. Pharmaceutical companies typically conduct trials involving new drugs or established drugs in disease areas where their drug may gain a new license. Device manufacturers use trials to prove the safety and efficacy of their new device. Clinical trials conducted by clinical investigators unrelated to pharmaceutical companies might have other aims. They might use established or older drugs in new disease areas, often without commercial support, given that older drugs are
unlikely to generate much profit. Clinical investigators might also:
  • look at the best way to give or withdraw drugs
  • investigate the best duration of treatment to maximize outcome
  • assess the benefits of prevention with vaccination or screening programs
Thus, different types of trials are needed to cover these needs; these can be classified under the following headings:
Phases:
The pharmaceutical industry has adopted a specific trial classification based on the four clinical phases of development of a particular drug (Phases I–IV). In Phase I, manufacturers usually test the effects of a new drug in healthy volunteers or patients unresponsive to usual therapies. They look at how the drug is handled in the human body (pharmacokinetics/pharmacodynamics), particularly with respect to the immediate short-term safety of higher doses. Clinical trials in Phase II examine dose–response curves in patients and what benefits might be seen in a small group of patients with a particular disease. In Phase III, a new drug is tested in a controlled fashion in a large patient population against a placebo or standard therapy. This is a key phase, where a drug will either make or break its reputation with respect to safety and efficacy before marketing begins. A positive study in Phase III is often known as a landmark study for a drug, through which it might gain a license to be prescribed for a specific disease. A study in Phase IV is often called a post-marketing study as the drug has already been granted regulatory approval/license. These studies are crucial for gathering additional safety information from a larger group of patients in order to understand the long-term safety of the drug and appreciate drug interactions.
Trial design:
Trials can be further classified by design. This classification is more descriptive in terms of how patients are randomized to treatment. The most common design is the parallel-group trial. Patients are randomized to the new treatment or to the standard treatment and followed-up to determine the effect of each treatment in parallel groups. Other trial designs include, amongst others, crossover trials, factorial trials, and cluster randomized trials.
Crossover trials randomize patients to different sequences of treatments, but all patients eventually get all treatments in varying order, i.e., the patient is his/her own control. Factorial trials assign patients to more than one treatment-comparison group. These are randomized in one trial at the same time, i.e., while drug A is being tested against placebo, patients are re-randomized to drug B or placebo, making four possible treatment combinations in total. Cluster randomized trials are performed when larger groups (e.g., patients of a single practitioner or hospital) are randomized instead of individual patients.


Number of centers:
Clinical trials can also be classified as single-center or multicenter studies according to the number of sites involved. While single-center studies are mainly used for Phase I and II studies, multicenter studies can be carried out at any stage of clinical development. Multicenter studies are necessary for two
major reasons:

  • to evaluate a new medication or procedure more efficiently in terms of accruing sufficient subjects over a shorter period of time
  • to provide a better basis for the subsequent generalization of the trial’s findings, i.e., the effects of the treatment are evaluated in many types of centers.

CLINICAL TRIAL PROTOCOL DEVELOPMENT AND KEY COMPONENTS OF TRIAL PROTOCOL


CLINICAL TRIAL PROTOCOL DEVELOPMENT

Once a clinical question has been postulated, the first step in the conception of a clinical trial to answer that question is to develop a trial protocol. A well-designed protocol reflects the scientific and methodological integrity of a trial. Protocol development has evolved in a complex way over the last 20 years to reflect the care and attention given to undertaking clinical experiments with human volunteers, reflecting the high standards of safety and ethics involved as well as the complex statistical issues.

Questions addressed by a protocol:
  • What is the clinical question being asked by the trial?
  • How should it be answered, in compliance with the standard ethical and regulatory requirements?
  • What analyses should be performed in order to produce meaningful results?
  • How will the results be presented?

Qualities of a good protocol:
  • Clear, comprehensive, easy to navigate, and unambiguous.
  • Designed in accordance with the current principles of Good Clinical Practice and other regulatory requirements.
  • Gives a sound scientific background of the trial.
  • Clearly identifies the benefits and risks of being recruited into the trial.
  • Plainly describes trial methodology and practicalities.
  • Ensures that the rights, safety, and well-being of trial participants are not unduly compromised.
  • Gives enough relevant information to make the trial and its results reproducible.
  • Indicates all features that assure the quality of every aspect of the the trial.


CLINICAL TRIAL PROTOCOL

The contents of a trial protocol should generally include the following topics. However,  site specific information may be provided on separate protocol page(s), or addressed in a separate agreement, and some of the information listed below may be contained in other protocol referenced documents, such as an Investigator’s Brochure.

General Information:
  • Protocol title, protocol identifying number, and date. Any amendment(s) should also bear the amendment number(s) and date(s).
  • Name and address of the sponsor and monitor (if other than the sponsor).
  • Name and title of the person(s) authorized to sign the protocol and the protocol amendment(s) for the sponsor.
  • Name, title, address, and telephone number(s) of the sponsor's medical expert (or dentist when appropriate) for the trial.
  • Name and title of the investigator(s) who is (are) responsible for conducting the trial, and the address and telephone number(s) of the trial site(s).
  • Name, title, address, and telephone number(s) of the qualified physician (or dentist, if applicable), who is responsible for all trial-site related medical (or dental) decisions (if other than investigator).
  • Name(s) and address(es) of the clinical laboratory(ies) and other medical and/or technical department(s) and/or institutions involved in the trial.


Background Information:
  • Name and description of the investigational product(s).
  • A summary of findings from nonclinical studies that potentially have clinical significance and from clinical trials that are relevant to the trial.
  • Summary of the known and potential risks and benefits, if any, to human subjects.
  • Description of and justification for the route of administration, dosage, dosage regimen, and treatment period(s).
  • A statement that the trial will be conducted in compliance with the protocol, GCP and the applicable regulatory requirement(s).
  • Description of the population to be studied.
  • References to literature and data that are relevant to the trial, and that provide background for the trial.

Trial Objectives and Purpose:
  • A detailed description of the objectives and the purpose of the trial.


Trial Design:
The scientific integrity of the trial and the credibility of the data from the trial depend
substantially on the trial design. A description of the trial design, should include:

  • A specific statement of the primary endpoints and the secondary endpoints, if any, to be measured during the trial.
  • A description of the type/design of trial to be conducted (e.g. double-blind, placebo-controlled, parallel design) and a schematic diagram of trial design, procedures and stages.
  • A description of the measures taken to minimize/avoid bias, including:
                    (a) Randomization.
                    (b) Blinding.

  • A description of the trial treatment(s) and the dosage and dosage regimen of the investigational product(s). Also include a description of the dosage form, packaging, and labelling of the investigational product(s).
  • The expected duration of subject participation, and a description of the sequence and duration of all trial periods, including follow-up, if any.
  • A description of the "stopping rules" or "discontinuation criteria" for individual subjects, parts of trial and entire trial.
  • Accountability procedures for the investigational product(s), including the placebo(s) and comparator(s), if any.
  • Maintenance of trial treatment randomization codes and procedures for breaking codes.
  • The identification of any data to be recorded directly on the CRFs (i.e. no prior written or electronic record of data), and to be considered to be source data.


Selection and Withdrawal of Subjects:
  • Subject inclusion criteria.
  • Subject exclusion criteria.
  • Subject withdrawal criteria (i.e. terminating investigational product treatment/trial treatment) and procedures specifying:

  1. When and how to withdraw subjects from the trial/ investigational product treatment.
  2. The type and timing of the data to be collected for withdrawn subjects.
  3. Whether and how subjects are to be replaced.
  4. The follow-up for subjects withdrawn from investigational product treatment/trial treatment.



Treatment of Subjects:
  • The treatment(s) to be administered, including the name(s) of all the product(s), the dose(s), the dosing schedule(s), the route/mode(s) of administration, and the treatment period(s), including the follow-up period(s) for subjects for each investigational product treatment/trial treatment group/arm of the trial.
  • Medication(s)/treatment(s) permitted (including rescue medication) and not permitted before and/or during the trial.
  • Procedures for monitoring subject compliance. 


Assessment of Efficacy:
  • Specification of the efficacy parameters.
  • Methods and timing for assessing, recording, and analysing of efficacy parameters.


Assessment of Safety:
  • Specification of safety parameters.
  • The methods and timing for assessing, recording, and analyzing safety parameters.
  • Procedures for eliciting reports of and for recording and reporting adverse event and intercurrent illnesses.
  • The type and duration of the follow-up of subjects after adverse events. 


Statistics:
  • A description of the statistical methods to be employed, including timing of any planned interim analysis(ses).
  • The number of subjects planned to be enrolled. In multicentre trials, the numbers of enrolled subjects projected for each trial site should be specified. Reason for choice of sample size, including reflections on (or calculations of) the power of the trial and clinical justification.
  • The level of significance to be used.
  • Criteria for the termination of the trial.
  • Procedure for accounting for missing, unused, and spurious data.
  • Procedures for reporting any deviation(s) from the original statistical plan (any deviation(s) from the original statistical plan should be described and justified in protocol and/or in the final report, as appropriate).
  • The selection of subjects to be included in the analyses (e.g. all randomized subjects, all dosed subjects, all eligible subjects, evaluable subjects).


Direct Access to Source Data/Documents:

  • The sponsor should ensure that it is specified in the protocol or other written agreement that the investigator(s)/institution(s) will permit trial-related monitoring, audits, IRB/IEC review, and regulatory inspection(s), providing direct access to source data/documents.



Quality Control and Quality Assurance:

Ethics:
  • Description of ethical considerations relating to the trial.

Data Handling and Record Keeping:

Financing and Insurance:
  • Financing and insurance if not addressed in a separate agreement. 


Publication Policy:
  • Publication policy, if not addressed in a separate agreement. 

Supplements:
(NOTE: Since the protocol and the clinical trial/study report are closely related, further relevant information can be found in the ICH Guideline for Structure and Content of Clinical Study Reports.)
When considering the above items, special attention must be given to designing a protocol that eliminates bias and reduces variance.








Key components of a trial protocol


The trial protocol is a comprehensive document and the core structure of the protocol should be adapted according to the type of trial. ICH–GCP can be used as a reference document when developing a protocol for pharmaceutical clinical trials (Phase I to Phase IV) involving a pharmaceutical substance (the investigational medicinal product [IMP]). Most institutions and pharmaceutical companies use a standard set of rules to define the main protocol outline, structure, format, and naming/numbering methods for their trials. In this section, we briefly describe the main components of a typical protocol.


Protocol information page:

The front page gives the:

  • trial title
  • trial identification number
  • protocol version number
  • date prepared

The descriptive title of the protocol should be kept as short as possible, but at the same time it should reflect the design, type of population, and aim of the trial. ICH–GCP suggests that the title of a pharmaceutical trial should additionally include the medicinal product(s), the nature of the treatment (eg, treatment, prophylaxis, diagnosis, radiosensitizer), any comparator(s) and/or placebo(s), indication, and setting (outpatient or inpatient). The key investigational site, investigator, and sponsor should also be detailed on the title page.



Trial summary or synopsis:
A synopsis should provide the key aspects of the protocol in no more than two pages, and can be prepared in a table format. The main components of the protocol summary include:
full title

  • principal investigator
  • planned study dates
  • objectives
  • study design
  • study population
  • treatments
  • procedures
  • sample size
  • outcome measures
  • statistical methods

Friday, 14 June 2013

Docking (molecular)

In the field of molecular modeling, docking is a method which predicts the preferred orientation of one molecule to a second when bound to each other to form a stable complex.[1] Knowledge of the preferred orientation in turn may be used to predict the strength of association or binding affinity between two molecules using for example scoring functions.
The associations between biologically relevant molecules such as proteins, nucleic acids, carbohydrates, and lipids play a central role in signal transduction. Furthermore, the relative orientation of the two interacting partners may affect the type of signal produced (e.g., agonism vs antagonism). Therefore docking is useful for predicting both the strength and type of signal produced.
Docking is frequently used to predict the binding orientation of small molecule drug candidates to their protein targets in order to in turn predict the affinity and activity of the small molecule. Hence docking plays an important role in the rational design of drugs.[2] Given the biological and pharmaceutical significance of molecular docking, considerable efforts have been directed towards improving the methods used to predict docking.


Definition of problem

Molecular docking can be thought of as a problem of “lock-and-key”, where one is interested in finding the correct relative orientation of the “key” which will open up the “lock” (where on the surface of the lock is the key hole, which direction to turn the key after it is inserted, etc.). Here, the protein can be thought of as the “lock” and the ligand can be thought of as a “key”. Molecular docking may be defined as an optimization problem, which would describe the “best-fit” orientation of a ligand that binds to a particular protein of interest. However, since both the ligand and the protein are flexible, a “hand-in-glove” analogy is more appropriate than “lock-and-key”.[3] During the course of the process, the ligand and the protein adjust their conformation to achieve an overall “best-fit” and this kind of conformational adjustments resulting in the overall binding is referred to as “induced-fit”.[4]
The focus of molecular docking is to computationally simulate the molecular recognition process. The aim of molecular docking is to achieve an optimized conformation for both the protein and ligand and relative orientation between protein and ligand such that the free energy of the overall system is minimized..

Docking approaches

Two approaches are particularly popular within the molecular docking community. One approach uses a matching technique that describes the protein and the ligand as complementary surfaces.[5][6][7] The second approach simulates the actual docking process in which the ligand-protein pairwise interaction energies are calculated.[8] Both approaches have significant advantages as well as some limitations. These are outlined below.

Shape complementarity

Geometric matching/ shape complementarity methods describe the protein and ligand as a set of features that make them dockable.[9] These features may include molecular surface / complementary surface descriptors. In this case, the receptor’s molecular surface is described in terms of its solvent-accessible surface area and the ligand’s molecular surface is described in terms of its matching surface description. The complementarity between the two surfaces amounts to the shape matching description that may help finding the complementary pose of docking the target and the ligand molecules. Another approach is to describe the hydrophobic features of the protein using turns in the main-chain atoms. Yet another approach is to use a Fourier shape descriptor technique.[10][11][12] Whereas the shape complementarity based approaches are typically fast and robust, they cannot usually model the movements or dynamic changes in the ligand/ protein conformations accurately, although recent developments allow these methods to investigate ligand flexibility. Shape complementarity methods can quickly scan through several thousand ligands in a matter of seconds and actually figure out whether they can bind at the protein’s active site, and are usually scalable to even protein-protein interactions. They are also much more amenable to pharmacophore based approaches, since they use geometric descriptions of the ligands to find optimal binding.

Simulation

The simulation of the docking process as such is a much more complicated process. In this approach, the protein and the ligand are separated by some physical distance, and the ligand finds its position into the protein’s active site after a certain number of “moves” in its conformational space. The moves incorporate rigid body transformations such as translations and rotations, as well as internal changes to the ligand’s structure including torsion angle rotations. Each of these moves in the conformation space of the ligand induces a total energetic cost of the system, and hence after every move the total energy of the system is calculated. The obvious advantage of the method is that it is more amenable to incorporate ligand flexibility into its modeling whereas shape complementarity techniques have to use some ingenious methods to incorporate flexibility in ligands. Another advantage is that the process is physically closer to what happens in reality, when the protein and ligand approach each other after molecular recognition. A clear disadvantage of this technique is that it takes longer time to evaluate the optimal pose of binding since they have to explore a rather large energy landscape. However grid-based techniques as well as fast optimization methods have significantly ameliorated these problems.

Mechanics of docking

To perform a docking screen, the first requirement is a structure of the protein of interest. Usually the structure has been determined using a biophysical technique such as x-ray crystallography, or less often, NMR spectroscopy. This protein structure and a database of potential ligands serve as inputs to a docking program. The success of a docking program depends on two components: the search algorithm and the scoring function.

Search algorithm

The search space in theory consists of all possible orientations and conformations of the protein paired with the ligand. However in practice with current computational resources, it is impossible to exhaustively explore the search space—this would involve enumerating all possible distortions of each molecule (molecules are dynamic and exist in an ensemble of conformational states) and all possible rotational and translational orientations of the ligand relative to the protein at a given level of granularity. Most docking programs in use account for a flexible ligand, and several attempt to model a flexible protein receptor. Each "snapshot" of the pair is referred to as a pose.
A variety of conformational search strategies have been applied to the ligand and to the receptor. These include:

Ligand flexibility

Conformations of the ligand may be generated in the absence of the receptor and subsequently docked[13] or conformations may be generated on-the-fly in the presence of the receptor binding cavity,[14] or with full rotational flexibility of every dihedral angle using fragment based docking.[15] Force field energy evaluation are most often used to select energetically reasonable conformations,[16] but knowledge-based methods have also been used.[17]

Receptor flexibility

Computational capacity has increased dramatically over the last decade making possible the use of more sophisticated and computationally intensive methods in computer-assisted drug design. However, dealing with receptor flexibility in docking methodologies is still a thorny issue. The main reason behind this difficulty is the large number of degrees of freedom that have to be considered in this kind of calculations. Neglecting it, however, leads to poor docking results in terms of binding pose prediction.[18]
Multiple static structures experimentally determined for the same protein in different conformations are often used to emulate receptor flexibility.[19] Alternatively rotamer libraries of amino acid side chains that surround the binding cavity may be searched to generate alternate but energetically reasonable protein conformations.[20][21]

Scoring function

The scoring function takes a pose as input and returns a number indicating the likelihood that the pose represents a favorable binding interaction.
Most scoring functions are physics-based molecular mechanics force fields that estimate the energy of the pose; a low (negative) energy indicates a stable system and thus a likely binding interaction. An alternative approach is to derive a statistical potential for interactions from a large database of protein-ligand complexes, such as the Protein Data Bank, and evaluate the fit of the pose according to this inferred potential.
There are a large number of structures from X-ray crystallography for complexes between proteins and high affinity ligands, but comparatively fewer for low affinity ligands as the later complexes tend to be less stable and therefore more difficult to crystallize. Scoring functions trained with this data can dock high affinity ligands correctly, but they will also give plausible docked conformations for ligands that do not bind. This gives a large number of false positive hits, i.e., ligands predicted to bind to the protein that actually don't when placed together in a test tube.
One way to reduce the number of false positives is to recalculate the energy of the top scoring poses using (potentially) more accurate but computationally more intensive techniques such as Generalized Born or Poisson-Boltzmann methods.[8]

Applications

A binding interaction between a small molecule ligand and an enzyme protein may result in activation or inhibition of the enzyme. If the protein is a receptor, ligand binding may result in agonism or antagonism. Docking is most commonly used in the field of drug design — most drugs are small organic molecules, and docking may be applied to:
  • hit identification – docking combined with a scoring function can be used to quickly screen large databases of potential drugs in silico to identify molecules that are likely to bind to protein target of interest (see virtual screening).
  • lead optimization – docking can be used to predict in where and in which relative orientation a ligand binds to a protein (also referred to as the binding mode or pose). This information may in turn be used to design more potent and selective analogs.
  • Bioremediation – Protein ligand docking can also be used to predict pollutants that can be degraded by enzymes.[22]
  • tutorials from rcmd.it - The use of Autodock and Autodock Vina is illustrated in a couple of tutorials prepared by Prof. Rino Ragno @Sapienza University. The tutorials are downloadable from www.rcmd.it

Chemometrics

Chemometrics is the science of extracting information from chemical systems by data-driven means. It is a highly interfacial discipline, using methods frequently employed in core data-analytic disciplines such as multivariate statistics, applied mathematics, and computer science, in order to address problems in chemistry, biochemistry, medicine, biology and chemical engineering. In this way, it mirrors several other interfacial ‘-metrics’ such as psychometrics and econometrics.

 Introduction
Chemometrics is applied to solve both descriptive and predictive problems in experimental life sciences, especially in chemistry. In descriptive applications, properties of chemical systems are modeled with the intent of learning the underlying relationships and structure of the system (i.e., model understanding and identification). In predictive applications, properties of chemical systems are modeled with the intent of predicting new properties or behavior of interest. In both cases, the datasets can be small but are often very large and highly complex, involving hundreds to thousands of variables, and hundreds to thousands of cases or observations.
Chemometric techniques are particularly heavily used in analytical chemistry and metabolomics, and the development of improved chemometric methods of analysis also continues to advance the state of the art in analytical instrumentation and methodology. It is an application driven discipline, and thus while the standard chemometric methodologies are very widely used industrially, academic groups are dedicated to the continued development of chemometric theory, method and application development.

Origins

Although one could argue that even the earliest analytical experiments in chemistry involved a form of chemometrics, the field is generally recognized to have emerged in the 1970s as computers became increasingly exploited for scientific investigation. The term ‘chemometrics’ was coined by Svante Wold in a grant application 1971,[1] and the International Chemometrics Society was formed shortly thereafter by Svante Wold and Bruce Kowalski, two pioneers in the field. Wold was a professor of organic chemistry at Umeå University, Sweden, and Kowalski was a professor of analytical chemistry at University of Washington, Seattle.
Many early applications involved multivariate classification, numerous quantitative predictive applications followed, and by the late 1970s and early 1980s a wide variety of data- and computer-driven chemical analyses were occurring.
Multivariate analysis was a critical facet even in the earliest applications of chemometrics. The data resulting from infrared and UV/visible spectroscopy are often easily numbering in the thousands of measurements per sample. Mass spectrometry, nuclear magnetic resonance, atomic emission/absorption and chromatography experiments are also all by nature highly multivariate. The structure of these data was found to be conducive to using techniques such as principal components analysis (PCA), and partial least-squares (PLS). This is primarily because, while the datasets may be highly multivariate there is strong and often linear low-rank structure present. PCA and PLS have been shown over time very effective at empirically modeling the more chemically interesting low-rank structure, exploiting the interrelationships or ‘latent variables’ in the data, and providing alternative compact coordinate systems for further numerical analysis such as regression, clustering, and pattern recognition. Partial least squares in particular was heavily used in chemometric applications for many years before it began to find regular use in other fields.
Through the 1980s three dedicated journals appeared in the field: Journal of Chemometrics, Chemometrics and Intelligent Laboratory Systems, and Journal of Chemical Information and Modeling. These journals continue to cover both fundamental and methodological research in chemometrics. At present, most routine applications of existing chemometric methods are commonly published in application-oriented journals (e.g., Applied Spectroscopy, Analytical Chemistry, Anal. Chim. Acta., Talanta). Several important books/monographs on chemometrics were also first published in the 1980s, including the first edition of Malinowski’s "Factor Analysis in Chemistry",[2] Sharaf, Illman and Kowalski’s "Chemometrics",[3] Massart et al. "Chemometrics: a textbook",[4] and "Multivariate Calibration" by Martens and Naes.[5]
Some large chemometric application areas have gone on to represent new domains, such as molecular modeling and QSAR, cheminformatics, the ‘-omics’ fields of genomics, proteomics, metabonomics and metabolomics, process modeling and process analytical technology.
An account of the early history of chemometrics was published as a series of interviews by Geladi and Esbensen.[6][7]

Techniques

Multivariate calibration

Many chemical problems and applications of chemometrics involve calibration. The objective is develop models which can be used to predict properties of interest based on measured properties of the chemical system, such as pressure, flow, temperature, infrared, Raman, NMR spectra and mass spectra. Examples include the development of multivariate models relating 1) multi-wavelength spectral response to analyte concentration, 2) molecular descriptors to biological activity, 3) multivariate process conditions/states to final product attributes. The process requires a calibration or training data set, which includes reference values for the properties of interest for prediction, and the measured attributes believed to correspond to these properties. For case 1), for example, one can assemble data from a number of samples, including concentrations for an analyte of interest for each sample (the reference) and the corresponding infrared spectrum of that sample. Multivariate calibration techniques such as partial-least squares regression, or principal component regression (and near countless other methods) are then used to construct a mathematical model that relates the multivariate response (spectrum) to the concentration of the analyte of interest, and such a model can be used to efficiently predict the concentrations of new samples.
Techniques in multivariate calibration are often broadly categorized as classical or inverse methods.[5][8] The principal difference between these approaches is that in classical calibration the models are solved such that they are optimal in describing the measured analytical responses (e.g., spectra) and can therefore be considered optimal descriptors, whereas in inverse methods the models are solved to be optimal in predicting the properties of interest (e.g., concentrations, optimal predictors).[9] Inverse methods usually require less physical knowledge of the chemical system, and at least in theory provide superior predictions in the mean-squared error sense,[10][11][12] and hence inverse approaches tend to be more frequently applied in contemporary multivariate calibration.
The main advantages of the use of multivariate calibration techniques is that fast, cheap, or non-destructive analytical measurements (such as optical spectroscopy) can be used to estimate sample properties which would otherwise require time-consuming, expensive or destructive testing (such as HPLC-MS). Equally important is that multivariate calibration allows for accurate quantitative analysis in the presence of heavy interference by other analytes. The selectivity of the analytical method is provided as much by the mathematical calibration, as the analytical measurement modalities. For example near-infrared spectra, which are extremely broad and non-selective compared to other analytical techniques (such as infrared or Raman spectra), can often be used successfully in conjunction with carefully developed multivariate calibration methods to predict concentrations of analytes in very complex matrices.

Classification, pattern recognition, clustering

Supervised multivariate classification techniques are closely related to multivariate calibration techniques in that a calibration or training set is used to develop a mathematical model capable of classifying future samples. The techniques employed in chemometrics are similar to those used in other fields – multivariate discriminant analysis, logistic regression, neural networks, regression/classification trees. The use of rank reduction techniques in conjunction with these conventional classification methods is routine in chemometrics, for example discriminant analysis on principal components or partial least squares scores.
Unsupervised classification (also termed cluster analysis) is also commonly used to discover patterns in complex data sets, and again many of the core techniques used in chemometrics are common to other fields such as machine learning and statistical learning.

Multivariate curve resolution

In chemometric parlance, multivariate curve resolution seeks to deconstruct data sets with limited or absent reference information and system knowledge. Some of the earliest work on these techniques was done by Lawton and Sylvestre in the early 1970s.[13][14] These approaches are also called self-modeling mixture analysis, blind source/signal separation, and spectral unmixing. For example, from a data set comprising fluorescence spectra from a series of samples each containing multiple fluorophores, multivariate curve resolution methods can be used to extract the fluorescence spectra of the individual fluorophores, along with their relative concentrations in each of the samples, essentially unmixing the total fluorescence spectrum into the contributions from the individual components. The problem is usually ill-determined due to rotational ambiguity (many possible solutions can equivalently represent the measured data), so the application of additional constraints is common, such as non-negatively, unmodality, or known interrelationships between the individual components (e.g., kinetic or mass-balance constraints).[15][16]

Other techniques

Experimental design remains a core area of study in chemometrics and several monographs are specifically devoted to experimental design in chemical applications.[17][18] Sound principles of experimental design have been widely adopted within the chemometrics community, although many complex experiments are purely observational, and there can be little control over the properties and interrelationships of the samples and sample properties.
Signal processing is also a critical component of almost all chemometric applications, particularly the use of signal pretreatments to condition data prior to calibration or classification. The techniques employed commonly in chemometrics are often closely related to those used in related fields.[19]
Performance characterization, and figures of merit Like most arenas in the physical sciences, chemometrics is quantitatively oriented, so considerable emphasis is placed on performance characterization, model selection, verification & validation, and figures of merit. The performance of quantitative models is usually specified by root mean squared error in predicting the attribute of interest, and the performance of classifiers as a true-positive rate/false-positive rate pairs (or a full ROC curve). A recent report by Olivieri et al. provides a comprehensive overview of figures of merit and uncertainty estimation in multivariate calibration, including multivariate definitions of selectivity, sensitivity, SNR and prediction interval estimation.[20] Chemometric model selection usually involves the use of tools such as resampling (including bootstrap, permutation, cross-validation).
Multivariate statistical process control (MSPC), modeling and optimization accounts for a substantial amount of historical chemometric development.[21][22][23] Spectroscopy has been used successfully for online monitoring of manufacturing processes for 30–40 years, and this process data is highly amenable to chemometric modeling. Specifically in terms of MSPC, multiway modeling of batch and continuous processes is increasingly common in industry and remains an active area of research in chemometrics and chemical engineering. Process analytical chemistry as it was originally termed,[24] or the newer term process analytical technology continues to draw heavily on chemometric methods and MSPC.
Multiway methods are heavily used in chemometric applications.[25][26] These are higher-order extensions of more widely used methods. For example, while the analysis of a table (matrix, or second-order array) of data is routine in several fields, multiway methods are applied to data sets that involve 3rd, 4th, or higher-orders. Data of this type is very common in chemistry, for example a liquid-chromatography / mass spectrometry (LC-MS) system generates a large matrix of data (elution time versus m/z) for each sample analyzed. The data across multiple samples thus comprises a data cube. Batch process modeling involves data sets that have time vs. process variables vs. batch number. The multiway mathematical methods applied to these sorts of problems include PARAFAC, trilinear decomposition, and multiway PLS and PCA.

Chemogenomics

Chemogenomics, or Chemical Genomics, is the systematic screening of targeted chemical libraries of small molecules against individual drug target families (e.g., GPCRs, nuclear receptors, kinases, proteases, etc.) with the ultimate goal of identification of novel drugs and drug targets.[1] Typically some members of a target library have been well characterized where both the function has been determined and compounds that modulate the function of those targets (ligands in the case of receptors, inhibitors of enzymes, or blockers of ion channels) have been identified. Other members of the target family may have unknown function with no known ligands and hence are classified as orphan receptors. By identifying screening hits that modulate the activity of the less well characterized members of the target family, the function of these novel targets can be elucidated. Furthermore the hits for these targets can be used as a starting point for drug discovery.
A common method to construct a targeted chemical library is to include known ligands of at least one and preferably several members of the target family. Since a portion of ligands that were designed and synthesized to bind to one family member will also bind to additional family members, the compounds contained in a targeted chemical library should collectively bind to a high percentage of the target family

CHEMOINFORMATICS

Cheminformatics (also known as chemoinformatics and chemical informatics) is the use of computer and informational techniques applied to a range of problems in the field of chemistry. These in silico techniques are used in, for example, pharmaceutical companies in the process of drug discovery. These methods can also be used in chemical and allied industries in various other forms


History

The term chemoinformatics was defined by F.K. Brown [1][2] in 1998:
Chemoinformatics is the mixing of those information resources to transform data into information and information into knowledge for the intended purpose of making better decisions faster in the area of drug lead identification and optimization.
Since then, both spellings have been used, and some have evolved to be established as Cheminformatics,[3] while European Academia settled in 2006 for Chemoinformatics.[4] The recent establishment of the Journal of Cheminformatics is a strong push towards the shorter variant.

Basics

Cheminformatics combines the scientific working fields of chemistry, computer science and information science for example in the areas of topology, chemical graph theory, information retrieval and data mining in the chemical space.[5][6][7] Cheminformatics can also be applied to data analysis for various industries like paper and pulp, dyes and such allied industries.

Applications

Storage and retrieval

The primary application of cheminformatics is in the storage, indexing and search of information relating to compounds. The efficient search of such stored information includes topics that are dealt with in computer science as data mining, information retrieval, information extraction and machine learning. Related research topics include:

File formats

The in silico representation of chemical structures uses specialized formats such as the XML-based Chemical Markup Language or SMILES. These representations are often used for storage in large chemical databases. While some formats are suited for visual representations in 2 or 3 dimensions, others are more suited for studying physical interactions, modeling and docking studies.

Virtual libraries

Chemical data can pertain to real or virtual molecules. Virtual libraries of compounds may be generated in various ways to explore chemical space and hypothesize novel compounds with desired properties.
Virtual libraries of classes of compounds (drugs, natural products, diversity-oriented synthetic products) were recently generated using the FOG (fragment optimized growth) algorithm. [8] This was done by using cheminformatic tools to train transition probabilities of a Markov chain on authentic classes of compounds, and then using the Markov chain to generate novel compounds that were similar to the training database.

Virtual screening

In contrast to high-throughput screening, virtual screening involves computationally screening in silico libraries of compounds, by means of various methods such as docking, to identify members likely to possess desired properties such as biological activity against a given target. In some cases, combinatorial chemistry is used in the development of the library to increase the efficiency in mining the chemical space. More commonly, a diverse library of small molecules or natural products is screened.

Quantitative structure-activity relationship (QSAR)

This is the calculation of quantitative structure-activity relationship and quantitative structure property relationship values, used to predict the activity of compounds from their structures. In this context there is also a strong relationship to Chemometrics. Chemical expert systems are also relevant, since they represent parts of chemical knowledge as an in silico representation.