Neural Nets for Multi-channel Attribution Estimation

Neural Nets for Multi-channel Attribution Estimation

Multi-Touch Attribution plays a crucial role in advertising and marketing, offering insight into the complex series of interactions within customer journeys during transactions or impressions. We will show multiple neural nets methods to estimate the attributions in this blog.

Yudi Zhang
Amazon Employee
Published Feb 23, 2024
Last Modified Feb 26, 2024
Authors: Yudi Zhang, Oshry Ben-Harush


Companies aim to attribute conversion credit across the different touchpoints in a customer's journey across marketing channels like paid search and banner ads. While randomized A/B testing provides the most accurate attribution, it is often infeasible to perform at scale without impacting customer experience or requiring large budgets. Thus, companies rely on multi-touch attribution (MTA) models to estimate channel influence.
Early MTA models like first-touch, last-touch, linear, and time decay used simplistic, rule-based assignment of credit. More advanced options have emerged, such as using cooperative game theory and the Shapley value to capture marginal touchpoint contributions However, computational factorial complexity remains a challenge. Markov chain models combined with a removal effect methodology incorporate state transition probabilities in customer journeys, but struggle with higher order dependencies. Causal approaches like survival analysis focus specifically on predicting conversion events. But assumptions about survival functions limit adaptability across diverse data. Importantly, most models lack exogenous variation in the historical training data. This poses a risk of bias from confounders, making accurate attribution difficult. CausalMTA aims to address this issue by eliminating confounding from both static and dynamic perspectives.
Nevertheless, these methodologies lean towards single-point prediction and overlook sequential patterns in user browsing history. Attribution credits derived from these methods often rely on heuristic additive assumptions, which may prove ineffective in practical scenarios. Additionally, assumptions about survival function, such as exponential hazard function or Weibull distribution, constrain the model’s capacity to adapt to diverse real-world data.
In this blog, we will use recurrent neural networks (RNNs), temporal convolutional neural networks (TCNs), ordinary differential equations (ODE)-LSTM and Transformers to model the customer journeys, the objective is to predict the conversion or not, essentially a binary classification problem, then we compare the attribution obtained by each method.


Neural ODE

Neural ODEs (https://arxiv.org/abs/1806.07366) are a class of deep learning models where the continuous dynamics of a system are approximated using ordinary differential equations. Instead of specifying discrete layers and parameters, Neural ODEs learn a continuous function that describes the evolution of states over time.
Pros: Useful for time-series data. Memory-efficient, flexible for irregular sampled data.
Cons: Computationally intensive due to the need for solving differential equations during training. Interpretability might be challenging due to the continuous nature of the model.


A type of recurrent neural network (RNN) architecture designed to capture long-term dependencies in sequential data. It uses a memory cell with gating mechanisms (input, output, and forget gates) to selectively remember or forget information over time (wiki).
Pros: Effective in capturing long-range dependencies in sequential data. Mitigates the vanishing gradient problem better than traditional RNNs.
Cons: Computationally more expensive compared to simpler RNN architectures.


TCNs (https://arxiv.org/abs/1608.08242) rely on causal convolutions, dilated convolutions, and residual connections to capture temporal dependencies in sequential data efficiently.
Pros: Parallelizable operations, leading to faster training compared to RNNs. Captures both short and relative long-term dependencies effectively.
Cons: May struggle with capturing extremely long-term dependencies.


Using multi-head attention enabled by transformers can help improve the way time series models handle long-term dependencies.
Pros: Parallel computation of attention heads enables efficient training on parallel computing hardware. Captures global dependencies across the input sequence through self-attention mechanisms.
Cons: Requires large amounts of data and computational resources for training. And might need temporal encoding to capture the irregular time gaps.

Attribution Estimation

We have two ways to obtain the attributions.

Explainable Methods

There are multiple methods implemented in https://captum.ai/tutorials/. The usage is easy to adapted, for example, after a model is trained, the way to obtain the attribution is simply below
from captum.attr import IntegratedGradients

# A generic function that will be used for calling attribute on attribution algorithm defined in input.
def attribute_image_features(algorithm, input, labels, **kwargs):
input.requires_grad = True
attributions = algorithm.attribute(input, labels, **kwargs)

return attributions

# Integrated gradients attribution algorithm https://arxiv.org/abs/1703.01365
ig = IntegratedGradients(model)
attributions, delta = attribute_image_features(ig, input, baselines=input * 0, return_convergence_delta=True)


We use X to denote the input data, the hidden states h are feed through a one-layer multilayer perceptron (MLP) to get v, where W is a learnable matrix. Then, we measure the importance of the touchpoint by assessing the similarity of v with the vector u and obtain a normalized importance weight a through a softmax function. It is noteworthy that, by design a > 0. This construction offers the advantage that the contribution of every touchpoint is always positive.
Afterward, we compute the vector s as the weighted sum of touchpoint representations based on the non-negative weights. Essentially, s is the convex combination of all h. u can be seen as a high-level representation of a fixed sequence. We can customize this attribution model by imposing constraints on u based on domain knowledge about touchpoint importance, it can either be kept fixed or initialized randomly and jointly learned during the process. In our modeling, we adopt the latter approach.
Use attention for attribution estimation
For comparison, we have two directions to measure the model performance. The first one focuses on conversion estimation performance, where we use AUC and PRAUC. The second part aims at the performance of calculated attributions for various channels. We introduce a metric called AURE (Area Under Removal Effects). AURE tracks the cumulative impact on conversion probabilities by successively removing channels based on the ranked attributions.


We demonstrate these comparisons on a public dataset https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/. The ODE-LSTM and ALSTM (AttentionLSTM) mthods are implemented with attention to eatimatie the attribution, whereas the rest methods use Integrated gradients.

Conversion Estimation Performance

Below table shows the performance with difference methods, yet this is a relative easy classification task, the performance from each method is roughly the same.

Attribution Estimation

We show normalized attribution scores of selected ten channels, as well as the AURE curves to validate attributions in below figures On this relative easy task (data), it appears that all four methods are similar and ODE-LSTM and Transformers are highly consistent.
Attribution comparisons on top 10 channels selected by the best performance model given each data.
AURE on top 100 channels selected by each method


In practice, the performance of these various MTA models can vary significantly across different marketing datasets. For example, in another undisclosed dataset we tested, an attention-based LSTM model (ALSTM) achieved the best performance in predicting conversion events. However, an ODE-RNN extension performed the best in terms of area under the receiver operating characteristic curve (AURE). This highlights how there is no one-size-fits-all best model - the optimal approach depends on the specific predictive goal and dataset characteristics.
In terms of efficiency, standard LSTM and temporal convolutional networks are generally the fastest to train. On the other hand, incorporating continuous-time dynamics through ODEs, as in ODE-LSTM models, introduces major computational expenses that can make training extremely slow. There is an inherent trade-off between predictive accuracy and training efficiency that must be weighed given runtime constraints.
Importantly, while this work references promising recent innovations in MTA, real-world empirical testing across diverse proprietary datasets is necessary to rigorously compare performance. As the results can vary significantly, no definitive conclusions can be drawn on which methods generalize the best. There remains ample opportunity for future research around integrating different modeling components to best capture the nuances of complex sequential attribution tasks. Testing on entirely new large-scale industry datasets is key to guiding the development of the next generation of hybrid MTA approaches.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.