{
  "name": "Attentive Normalization",
  "full_name": "Attentive Normalization",
  "description": "**Attentive Normalization** generalizes the common affine transformation component in the vanilla feature normalization. Instead of learning a single affine transformation, AN learns a mixture of affine transformations and utilizes their weighted-sum as the final affine transformation applied to re-calibrate features in an instance-specific way. The weights are learned by leveraging feature attention.",
  "title": "Attentive Normalization",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "IQ-Learn",
  "full_name": "Inverse Q-Learning",
  "description": "**Inverse Q-Learning (IQ-Learn)** is a a simple, stable & data-efficient framework for Imitation Learning (IL), that directly learns *soft Q-functions* from expert data. IQ-Learn enables **non-adverserial** imitation learning, working on both offline and online IL settings. It is performant even with very sparse expert data, and scales to complex image-based environments, surpassing prior methods by more than **3x**. \r\n\r\nIt is very simple to implement requiring ~15 lines of code on top of existing RL methods.\r\n\r\n<span class=\"description-source\">Source: [IQ-Learn: Inverse soft Q-Learning for Imitation](https://arxiv.org/abs/2106.12142)</span>",
  "title": null,
  "collection": "Imitation Learning Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "OverFeat",
  "full_name": "OverFeat",
  "description": "**OverFeat** is a classic type of convolutional neural network architecture, employing [convolution](https://paperswithcode.com/method/convolution), pooling and fully connected layers. The Figure to the right shows the architectural details.",
  "title": "OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Subformer",
  "full_name": "Subformer",
  "description": "**Subformer** is a [Transformer](https://paperswithcode.com/method/transformer) that combines sandwich-style parameter sharing, which overcomes naive cross-layer parameter sharing in generative models, and self-attentive embedding factorization (SAFE). In SAFE, a small self-attention layer is used to reduce embedding parameter count.",
  "title": "Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "GeGLU",
  "full_name": "GeGLU",
  "description": "**GeGLU** is an activation function which is a variant of [GLU](https://paperswithcode.com/method/glu). The definition is as follows:\r\n\r\n$$ \\text{GeGLU}\\left(x, W, V, b, c\\right) = \\text{GELU}\\left(xW + b\\right) \\otimes \\left(xV + c\\right) $$",
  "title": "GLU Variants Improve Transformer",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "FSAF",
  "full_name": "FSAF",
  "description": "**FSAF**, or Feature Selective Anchor-Free, is a building block for single-shot object detectors. It can be plugged into single-shot detectors with feature pyramid structure. The FSAF module addresses two limitations brought up by the conventional anchor-based detection: 1) heuristic-guided feature selection; 2) overlap-based anchor sampling. The general concept of the FSAF module is online feature selection applied to the training of multi-level anchor-free branches. Specifically, an anchor-free branch is attached to each level of the feature pyramid, allowing box encoding and decoding in the anchor-free manner at an arbitrary level. During training, we dynamically assign each instance to the most suitable feature level. At the time of inference, the FSAF module can work jointly with anchor-based branches by outputting predictions in parallel. We instantiate this concept with simple implementations of anchor-free branches and online feature selection strategy\r\n\r\nThe general concept is presented in the Figure to the right. An anchor-free branch is built per level of feature pyramid, independent to the anchor-based branch. Similar to the anchor-based branch, it consists of a classification subnet and a regression subnet (not shown in figure). An instance can be assigned to arbitrary level of the anchor-free branch. During training, we dynamically select the most suitable level of feature for each instance based on the instance content instead of just the size of instance box. The selected level of feature then learns to detect the assigned instances. At inference, the FSAF module can run independently or jointly with anchor-based branches. The FSAF module is agnostic to the backbone network and can be applied to single-shot detectors with a structure of feature pyramid. Additionally, the instantiation of anchor-free branches and online feature selection can be various.",
  "title": "Feature Selective Anchor-Free Module for Single-Shot Object Detection",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "Pix2Pix",
  "full_name": "Pix2Pix",
  "description": "**Pix2Pix** is a conditional image-to-image translation architecture that uses a conditional [GAN](https://paperswithcode.com/method/gan) objective combined with a reconstruction loss. The conditional GAN objective for observed images $x$, output images $y$ and the random noise vector $z$ is:\r\n\r\n$$ \\mathcal{L}\\_{cGAN}\\left(G, D\\right) =\\mathbb{E}\\_{x,y}\\left[\\log D\\left(x, y\\right)\\right]+\r\n\\mathbb{E}\\_{x,z}\\left[log(1 − D\\left(x, G\\left(x, z\\right)\\right)\\right] $$\r\n\r\nWe augment this with a reconstruction term:\r\n\r\n$$ \\mathcal{L}\\_{L1}\\left(G\\right) = \\mathbb{E}\\_{x,y,z}\\left[||y - G\\left(x, z\\right)||\\_{1}\\right] $$\r\n\r\nand we get the final objective as:\r\n\r\n$$ G^{*} = \\arg\\min\\_{G}\\max\\_{D}\\mathcal{L}\\_{cGAN}\\left(G, D\\right) + \\lambda\\mathcal{L}\\_{L1}\\left(G\\right) $$\r\n\r\nThe architectures employed for the generator and discriminator closely follow [DCGAN](https://paperswithcode.com/method/dcgan), with a few modifications:\r\n\r\n- Concatenated skip connections are used to \"shuttle\" low-level information between the input and output, similar to a [U-Net](https://paperswithcode.com/method/u-net).\r\n- The use of a [PatchGAN](https://paperswithcode.com/method/patchgan) discriminator that only penalizes structure at the scale of patches.",
  "title": "Image-to-Image Translation with Conditional Adversarial Networks",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Auxiliary Classifier",
  "full_name": "Auxiliary Classifier",
  "description": "**Auxiliary Classifiers** are type of architectural component that seek to improve the convergence of very deep networks. They are classifier heads we attach to layers before the end of the network. The motivation is to push useful gradients to the lower layers to make them immediately useful and improve the convergence during training by combatting the vanishing gradient problem. They are notably used in the Inception family of convolutional neural networks.",
  "title": null,
  "collection": "Miscellaneous Components",
  "area": "General"
}
{
  "name": "Neural Architecture Search",
  "full_name": "Neural Architecture Search",
  "description": "**Neural Architecture Search (NAS)** learns a modular architecture which can be transferred from a small dataset to a large dataset. The method does this by reducing the problem of learning best convolutional architectures to the problem of learning a small convolutional cell. The cell can then be stacked in series to handle larger images and more complex datasets.\r\n\r\nNote that this refers to the original method referred to as NAS - there is also a broader category of methods called \"neural architecture search\".",
  "title": "Learning Transferable Architectures for Scalable Image Recognition",
  "collection": "Neural Architecture Search",
  "area": "General"
}
{
  "name": "Double Q-learning",
  "full_name": "Double Q-learning",
  "description": "**Double Q-learning** is an off-policy reinforcement learning algorithm that utilises double estimation to counteract overestimation problems with traditional Q-learning. \r\n\r\nThe max operator in standard [Q-learning](https://paperswithcode.com/method/q-learning) and [DQN](https://paperswithcode.com/method/dqn) uses the same values both to select and to evaluate an action. This makes it more likely to select overestimated values, resulting in overoptimistic value estimates. To prevent this, we can decouple the selection from the evaluation, which is the idea behind Double Q-learning:\r\n\r\n$$ Y^{Q}\\_{t} = R\\_{t+1} + \\gamma{Q}\\left(S\\_{t+1}, \\arg\\max\\_{a}Q\\left(S\\_{t+1}, a; \\mathbb{\\theta}\\_{t}\\right);\\mathbb{\\theta}\\_{t}\\right) $$\r\n\r\nThe Double Q-learning error can then be written as:\r\n\r\n$$ Y^{DoubleQ}\\_{t} = R\\_{t+1} + \\gamma{Q}\\left(S\\_{t+1}, \\arg\\max\\_{a}Q\\left(S\\_{t+1}, a; \\mathbb{\\theta}\\_{t}\\right);\\mathbb{\\theta}^{'}\\_{t}\\right) $$\r\n\r\nHere the selection of the action in the $\\arg\\max$ is still due to the online weights $\\theta\\_{t}$. But we use a second set of weights $\\mathbb{\\theta}^{'}\\_{t}$ to fairly evaluate the value of this policy.\r\n\r\nSource: [Deep Reinforcement Learning with Double Q-learning](https://paperswithcode.com/paper/deep-reinforcement-learning-with-double-q)",
  "title": "Double Q-learning",
  "collection": "Off-Policy TD Control",
  "area": "Reinforcement Learning"
}
{
  "name": "PipeDream-2BW",
  "full_name": "PipeDream-2BW",
  "description": "**PipeDream-2BW** is an asynchronous pipeline parallel method that supports memory-efficient pipeline parallelism, a hybrid form of parallelism that combines data and model parallelism with input pipelining. PipeDream-2BW uses a novel pipelining and weight gradient coalescing strategy, combined with the double buffering of weights, to ensure high throughput, low memory footprint, and weight update semantics similar to data parallelism. In addition, PipeDream2BW automatically partitions the model over the available hardware resources, while respecting hardware constraints such as memory capacities of accelerators, and topologies and bandwidths of interconnects. PipeDream-2BW also determines when to employ existing memory-savings techniques, such as activation recomputation, that trade off extra computation for lower memory footprint.\r\n\r\nThe two main features are a double-buffered weight update (2BW) and flush mechanisms ensure high throughput. PipeDream-2BW\r\nsplits models into stages over multiple workers, and each stage is replicated an equal number of times (with data-parallel updates across replicas of the same stage).  Such parallel pipelines work well for models where each layer is repeated a fixed number of times (e.g., [transformer](https://paperswithcode.com/method/transformer) models).",
  "title": "Memory-Efficient Pipeline-Parallel DNN Training",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "TaxoExpan",
  "full_name": "TaxoExpan",
  "description": "**TaxoExpan** is a self-supervised taxonomy expansion framework. It automatically generates a set of <query concept, anchor concept> pairs from the existing taxonomy as training data. Using such self-supervision data, TaxoExpan learns a model to predict whether a query concept is the direct hyponym of an anchor concept. TaxoExpan features: (1) a position-enhanced graph neural network that encodes the local structure of an anchor concept in the existing taxonomy, and (2) a noise-robust training objective that enables the learned model to be insensitive to the label noise in the self-supervision data.",
  "title": "TaxoExpan: Self-supervised Taxonomy Expansion with Position-Enhanced Graph Neural Network",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "YOLOX",
  "full_name": "YOLOX",
  "description": "**YOLOX** is a single-stage object detector that makes several modifications to [YOLOv3](https://paperswithcode.com/method/yolov3) with a  [DarkNet53](https://www.paperswithcode.com/method/darknet53) backbone. Specifically, YOLO’s head is replaced with a decoupled one. For each level of [FPN](https://paperswithcode.com/method/fpn) feature, we first adopt a 1 × 1 conv layer to reduce the feature channel to 256 and then add two parallel branches with two 3 × 3 conv layers each for classification and regression tasks respectively.\r\n\r\nAdditional changes include adding Mosaic and [MixUp](https://paperswithcode.com/method/mixup) into the augmentation strategies to boost YOLOX’s performance. The anchor mechanism is also removed so YOLOX is anchor-free. Lastly, SimOTA for label assignment  -- where label assignment is formulated as an optimal transport problem via a top-k strategy.",
  "title": "YOLOX: Exceeding YOLO Series in 2021",
  "collection": "One-Stage Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "Feature Intertwiner",
  "full_name": "Feature Intertwiner",
  "description": "**Feature Intertwiner** is an object detection module that leverages the features from a more reliable set to help guide the feature learning of another less reliable set. The mutual learning process helps two sets to have closer distance within the cluster in each class. The intertwiner is applied on the object detection task, where a historical buffer is proposed to address the sample missing problem during one mini-batch and the optimal transport (OT) theory is introduced to enforce the similarity among the two sets.",
  "title": "Feature Intertwiner for Object Detection",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "Cluster-GCN",
  "full_name": "Cluster-GCN",
  "description": "Cluster-GCN is a novel GCN algorithm that is suitable for SGD-based training by exploiting the graph clustering structure. Cluster-GCN works as the following: at each step, it samples a block of nodes that associate with a dense subgraph identified by a graph clustering algorithm, and restricts the neighborhood search within this subgraph. This simple but effective strategy leads to significantly improved memory and computational efficiency while being able to achieve comparable test accuracy with previous algorithms.\r\n\r\nDescription and image from: [Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks](https://arxiv.org/pdf/1905.07953.pdf)",
  "title": "Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "gMLP",
  "full_name": "gMLP",
  "description": "**gMLP** is an [MLP](https://paperswithcode.com/methods/category/feedforward-networks)-based alternative to [Transformers](https://paperswithcode.com/methods/category/vision-transformer) without [self-attention](https://paperswithcode.com/method/scaled), which simply consists of channel projections and spatial projections with static parameterization. It is built out of basic MLP layers with gating. The model consists of a stack of $L$ blocks with identical size and structure. Let $X \\in \\mathbb{R}^{n \\times d}$ be the token representations with sequence length $n$ and dimension $d$. Each block is defined as:\r\n\r\n$$\r\nZ=\\sigma(X U), \\quad \\tilde{Z}=s(Z), \\quad Y=\\tilde{Z} V\r\n$$\r\n\r\nwhere $\\sigma$ is an activation function such as [GeLU](https://paperswithcode.com/method/gelu). $U$ and $V$ define linear projections along the channel dimension - the same as those in the FFNs of Transformers (e.g., their shapes are $768 \\times 3072$ and $3072 \\times 768$ for $\\text{BERT}_{\\text {base }}$).\r\n\r\nA key ingredient is $s(\\cdot)$, a layer which captures spatial interactions. When $s$ is an identity mapping, the above transformation degenerates to a regular FFN, where individual tokens are processed independently without any cross-token communication. One of the major focuses is therefore to design a good $s$ capable of capturing complex spatial interactions across tokens. This leads to the use of a [Spatial Gating Unit](https://www.paperswithcode.com/method/spatial-gating-unit) which involves a modified linear gating.\r\n\r\nThe overall block layout is inspired by [inverted bottlenecks](https://paperswithcode.com/method/inverted-residual-block), which define $s(\\cdot)$ as a [spatial depthwise convolution](https://paperswithcode.com/method/depthwise-separable-convolution). Note, unlike Transformers, gMLP does not require position embeddings because such information will be captured in $s(\\cdot)$.",
  "title": "Pay Attention to MLPs",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "STTP",
  "full_name": "Spectral Tensor Train Parameterization",
  "description": "",
  "title": "Spectral Tensor Train Parameterization of Deep Learning Layers",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "ArcFace",
  "full_name": "Additive Angular Margin Loss",
  "description": "**ArcFace**, or **Additive Angular Margin Loss**, is a loss function used in face recognition tasks. The [softmax](https://paperswithcode.com/method/softmax) is traditionally used in these tasks. However, the softmax loss function does not explicitly optimise the feature embedding to enforce higher similarity for intraclass samples and diversity for inter-class samples, which results in a performance gap for deep face recognition under large intra-class appearance variations. \r\n\r\nThe ArcFace loss transforms the logits $W^{T}\\_{j}x\\_{i} = || W\\_{j} || \\text{ } || x\\_{i} || \\cos\\theta\\_{j}$,\r\nwhere $\\theta\\_{j}$ is the angle between the weight $W\\_{j}$ and the feature $x\\_{i}$. The individual weight $ || W\\_{j} || = 1$ is fixed by $l\\_{2}$ normalization. The embedding feature $ ||x\\_{i} ||$ is fixed by $l\\_{2}$ normalization and re-scaled to $s$. The normalisation step on features and weights makes the predictions only depend on the angle between the feature and the weight. The learned embedding\r\nfeatures are thus distributed on a hypersphere with a radius of $s$. Finally, an additive angular margin penalty $m$ is added between $x\\_{i}$ and $W\\_{y\\_{i}}$ to simultaneously enhance the intra-class compactness and inter-class discrepancy. Since the proposed additive angular margin penalty is\r\nequal to the geodesic distance margin penalty in the normalised hypersphere, the method is named ArcFace:\r\n\r\n$$ L\\_{3} = -\\frac{1}{N}\\sum^{N}\\_{i=1}\\log\\frac{e^{s\\left(\\cos\\left(\\theta\\_{y\\_{i}} + m\\right)\\right)}}{e^{s\\left(\\cos\\left(\\theta\\_{y\\_{i}} + m\\right)\\right)} + \\sum^{n}\\_{j=1, j \\neq y\\_{i}}e^{s\\cos\\theta\\_{j}}} $$\r\n\r\nThe authors select face images from 8 different identities containing enough samples (around 1,500 images/class) to train 2-D feature embedding networks with the softmax and ArcFace loss, respectively. As the Figure shows, the softmax loss provides roughly separable feature embedding\r\nbut produces noticeable ambiguity in decision boundaries, while the proposed ArcFace loss can obviously enforce a more evident gap between the nearest classes.\r\n\r\nOther alternatives to enforce intra-class compactness and inter-class distance include [Supervised Contrastive Learning](https://arxiv.org/abs/2004.11362).",
  "title": "ArcFace: Additive Angular Margin Loss for Deep Face Recognition",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "E-Branchformer",
  "full_name": "E-Branchformer",
  "description": "E-BRANCHFORMER: BRANCHFORMER WITH ENHANCED MERGING FOR SPEECH RECOGNITION",
  "title": "E-Branchformer: Branchformer with Enhanced merging for speech recognition",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "VarImpVIANN",
  "full_name": "Variance-based Feature Importance of Artificial Neural Networks",
  "description": "",
  "title": "Variance-Based Feature Importance in Neural Networks",
  "collection": "Feedforward Networks",
  "area": "General"
}
{
  "name": "RUN",
  "full_name": "Rung Kutta optimization",
  "description": "The optimization field suffers from the metaphor-based “pseudo-novel” or “fancy” optimizers. Most of these cliché methods mimic animals' searching trends and possess a small contribution to the optimization process itself. Most of these cliché methods suffer from the locally efficient performance, biased verification methods on easy problems, and high similarity between their components' interactions. This study attempts to go beyond the traps of metaphors and introduce a novel metaphor-free population-based optimization method based on the mathematical foundations and ideas of the Runge Kutta (RK) method widely well-known in mathematics. The proposed RUNge Kutta optimizer (RUN) was developed to deal with various types of optimization problems in the future. The RUN utilizes the logic of slope variations computed by the RK method as a promising and logical searching mechanism for global optimization. This search mechanism benefits from two active exploration and exploitation phases for exploring the promising regions in the feature space and constructive movement toward the global best solution. Furthermore, an enhanced solution quality (ESQ) mechanism is employed to avoid the local optimal solutions and increase convergence speed. The RUN algorithm's efficiency was evaluated by comparing with other metaheuristic algorithms in 50 mathematical test functions and four real-world engineering problems. The RUN provided very promising and competitive results, showing superior exploration and exploitation tendencies, fast convergence rate, and local optima avoidance. In optimizing the constrained engineering problems, the metaphor-free RUN demonstrated its suitable performance as well. The authors invite the community for extensive evaluations of this deep-rooted optimizer as a promising tool for real-world optimization",
  "title": null,
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "InternVideo",
  "full_name": "InternVideo: General Video Foundation Models via Generative and Discriminative Learning",
  "description": "The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo.",
  "title": null,
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "Policy Similarity Metric",
  "full_name": "Policy Similarity Metric",
  "description": "**Policy Similarity Metric**, or **PSM**, is a similarity metric for measuring behavioral similarity between states in reinforcement learning. It assigns high similarity to states for which the optimal policies in those states as well as in future states are similar. PSM is reward-agnostic, making it more robust for generalization compared to approaches that rely on reward information.",
  "title": "Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning",
  "collection": "State Similarity Metrics",
  "area": "Reinforcement Learning"
}
{
  "name": "PP-OCR",
  "full_name": "PP-OCR",
  "description": "**PP-OCR** is an OCR system that consists of three parts, text detection, detected boxes rectification and text recognition. The purpose of text detection is to locate the text area in the image. In PP-OCR, Differentiable Binarization (DB) is used as text detector which is based on a simple segmentation network. It integrates feature extraction and sequence modeling. It adopts the Connectionist Temporal Classification (CTC) loss to avoid the inconsistency between prediction and label.",
  "title": "PP-OCR: A Practical Ultra Lightweight OCR System",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "NNCLR",
  "full_name": "Nearest-Neighbor Contrastive Learning of Visual Representations",
  "description": "",
  "title": "With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "ShuffleNet Block",
  "full_name": "ShuffleNet Block",
  "description": "A **ShuffleNet Block** is an image model block that utilises a [channel shuffle](https://paperswithcode.com/method/channel-shuffle) operation, along with depthwise convolutions, for an efficient architectural design. It was proposed as part of the [ShuffleNet](https://paperswithcode.com/method/shufflenet) architecture. The starting point is the [Residual Block](https://paperswithcode.com/method/residual-block) unit from [ResNets](https://paperswithcode.com/method/resnet), which is then modified with a pointwise group [convolution](https://paperswithcode.com/method/convolution) and a channel shuffle operation.",
  "title": "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "node2vec",
  "full_name": "node2vec",
  "description": "**node2vec** is a framework for learning graph embeddings for nodes in graphs. Node2vec maximizes a likelihood objective over mappings which preserve neighbourhood distances in higher dimensional spaces. From an algorithm design perspective, node2vec exploits the freedom to define neighbourhoods for nodes and provide an explanation for the effect of the choice of neighborhood on the learned representations. \r\n\r\nFor each node, node2vec simulates biased random walks based on an efficient network-aware search strategy and the nodes appearing in the random walk define neighbourhoods. The search strategy accounts for the relative influence nodes exert in a network. It also generalizes prior work alluding to naive search strategies by providing flexibility in exploring neighborhoods.",
  "title": "node2vec: Scalable Feature Learning for Networks",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "Dot-Product Attention",
  "full_name": "Dot-Product Attention",
  "description": "**Dot-Product Attention** is an attention mechanism where the alignment score function is calculated as: \r\n\r\n$$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = h\\_{i}^{T}s\\_{j}$$\r\n\r\nIt is equivalent to [multiplicative attention](https://paperswithcode.com/method/multiplicative-attention) (without a trainable weight matrix, assuming this is instead an identity matrix). Here $\\textbf{h}$ refers to the hidden states for the encoder, and $\\textbf{s}$ is the hidden states for the decoder. The function above is thus a type of alignment score function. \r\n\r\nWithin a neural network, once we have the alignment scores, we calculate the final scores/weights using a [softmax](https://paperswithcode.com/method/softmax) function of these alignment scores (ensuring it sums to 1).",
  "title": "Effective Approaches to Attention-based Neural Machine Translation",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "FINCH Clustering",
  "full_name": "First Integer Neighbor Clustering Hierarchy (FINCH))",
  "description": "FINCH is a parameter-free fast and scalable clustering algorithm. it stands out for its speed and clustering quality.",
  "title": "Efficient Parameter-Free Clustering Using First Neighbor Relations",
  "collection": "Clustering",
  "area": "General"
}
{
  "name": "CDCC-NET",
  "full_name": "CDCC-NET",
  "description": "CDCC-NET is a multi-task network that analyzes the detected counter region and predicts 9 outputs: eight float numbers referring to the corner positions (x0/w, y0/h, ... , x3/w, y3/h) and an array containing two float numbers regarding the probability of the counter being legible/operational or illegible/faulty.",
  "title": "Towards Image-based Automatic Meter Reading in Unconstrained Scenarios: A Robust and Efficient Approach",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "VPSNet",
  "full_name": "Video Panoptic Segmentation Network",
  "description": "**Video Panoptic Segmentation Network**, or **VPSNet**, is a model for video panoptic segmentation. On top of UPSNet, which is a method for image panoptic segmentation, VPSNet is designed to take an additional frame as the reference to correlate time information at two levels: pixel-level fusion and object-level tracking. To pick up the complementary feature points in the reference frame, a flow-based feature map alignment module is introduced along with an asymmetric attention block that computes similarities between the target and reference features to fuse them into one-frame shape. Additionally, to associate object instances across time, \r\n an object track head is added which learns the correspondence between the instances in the target and reference frames based\r\non their RoI feature similarity.",
  "title": "Video Panoptic Segmentation",
  "collection": "Video Panoptic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "SEED RL",
  "full_name": "SEED RL",
  "description": "**SEED** (Scalable, Efficient, Deep-RL) is a scalable reinforcement learning agent. It utilizes an architecture that features centralized inference and an optimized communication layer. SEED adopts two state of the art distributed algorithms, [IMPALA](https://paperswithcode.com/method/impala)/[V-trace](https://paperswithcode.com/method/v-trace) (policy gradients) and R2D2 ([Q-learning](https://paperswithcode.com/method/q-learning)).",
  "title": "SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "OA-Loss",
  "full_name": "Object-Aware Loss",
  "description": "**OA-Loss** reduces the domain gap between the original and augmented images. OA-Loss enables the model to learn semantic relations of foreground and background instances from multi-domain.",
  "title": "Object-Aware Domain Generalization for Object Detection",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "Bottleneck Transformer Block",
  "full_name": "Bottleneck Transformer Block",
  "description": "A **Bottleneck Transformer Block** is a block used in [Bottleneck Transformers](https://www.paperswithcode.com/method/bottleneck-transformer) that replaces the spatial 3 × 3 [convolution](https://paperswithcode.com/method/convolution) layer in a [Residual Block](https://paperswithcode.com/method/residual-block) with Multi-Head Self-Attention (MHSA).",
  "title": "Bottleneck Transformers for Visual Recognition",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Lbl2Vec",
  "full_name": "Lbl2Vec",
  "description": "",
  "title": "Lbl2Vec: An Embedding-Based Approach for Unsupervised Document Retrieval on Predefined Topics",
  "collection": "Text Classification Models",
  "area": "Natural Language Processing"
}
{
  "name": "AlterNet",
  "full_name": "AlterNet",
  "description": "",
  "title": "How Do Vision Transformers Work?",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "MPNet",
  "full_name": "MPNet",
  "description": "**MPNet** is a pre-training method for language models that combines masked language modeling (MLM) and permuted language modeling (PLM) in one view. It takes the dependency among the predicted tokens into consideration through permuted language modeling and thus avoids the issue of [BERT](https://paperswithcode.com/method/bert). On the other hand, it takes position information of all tokens as input to make the model see the position information of all the tokens and thus alleviates the position discrepancy of [XLNet](https://paperswithcode.com/method/xlnet).\r\n\r\nThe training objective of MPNet is:\r\n\r\n$$ \\mathbb{E}\\_{z\\in{\\mathcal{Z}\\_{n}}} \\sum^{n}\\_{t=c+1}\\log{P}\\left(x\\_{z\\_{t}}\\mid{x\\_{z\\_{<t}}}, M\\_{z\\_{{>}{c}}}; \\theta\\right) $$\r\n\r\nAs can be seen, MPNet conditions on ${x\\_{z\\_{<t}}}$ (the tokens preceding the current predicted token $x\\_{z\\_{t}}$) rather than only the non-predicted tokens ${x\\_{z\\_{<=c}}}$ in MLM; comparing with PLM, MPNet takes more information (i.e., the mask symbol $[M]$ in position $z\\_{>c}$) as inputs. Although the objective seems simple, it is challenging to implement the model efficiently. For details, see the paper.",
  "title": "MPNet: Masked and Permuted Pre-training for Language Understanding",
  "collection": "Language Model Pre-Training",
  "area": "Natural Language Processing"
}
{
  "name": "GER",
  "full_name": "Gait Emotion Recognition",
  "description": "We present a novel classifier network called STEP, to classify perceived human emotion from gaits, based on a Spatial Temporal Graph Convolutional Network (ST-[GCN](https://paperswithcode.com/method/gcn)) architecture. Given an RGB video of an individual walking, our formulation implicitly exploits the gait features to classify the perceived emotion of the human into one of four emotions: happy, sad, angry, or neutral. We train STEP on annotated real-world gait videos, augmented with annotated synthetic gaits generated using a novel generative network called STEP-Gen, built on an ST-GCN based Conditional Variational Autoencoder (CVAE). We incorporate a novel push-pull regularization loss in the CVAE formulation of STEP-Gen to generate realistic gaits and improve the classification accuracy of STEP.\r\nWe also release a novel dataset (E-Gait), which consists of 4,227 human gaits annotated with perceived emotions along with thousands of synthetic gaits. In practice, STEP can learn the affective features and exhibits classification accuracy of 88\\% on E-Gait, which is 14--30\\% more accurate over prior methods.",
  "title": "STEP: Spatial Temporal Graph Convolutional Networks for Emotion Perception from Gaits",
  "collection": null,
  "area": null
}
{
  "name": "GSoP-Net",
  "full_name": "Global second-order pooling convolutional networks",
  "description": "A Gsop block has a squeeze module and an excitation module, and uses a second-order pooling to model high-order statistics while gathering global information.\r\nIn the squeeze module, a GSoP block firstly reduces the number of channels from $c$ to $c'$ ($c' < c$) using a $1 \\times 1$ convolution,  then  computes a $c' \\times c'$ covariance matrix for the different channels to obtain their correlation.  Next, row-wise normalization is performed on the covariance matrix.  Each $(i, j)$ in the normalized covariance matrix explicitly relates channel $i$ to channel $j$. \r\n\r\nIn the excitation module, a GSoP block performs row-wise convolution to  maintain structural information and output a vector. Then a fully-connected layer and a sigmoid function are applied  to get a $c$-dimensional attention vector. Finally, it multiplies the input features by the attention vector, as in an SE block. A GSoP block can be formulated as:\r\n\\begin{align}\r\n    s = F_\\text{gsop}(X, \\theta) & = \\sigma (W \\text{RC}(\\text{Cov}(\\text{Conv}(X))))\r\n\\end{align}\r\n\\begin{align}\r\n    Y & = s  X\r\n\\end{align}\r\nHere, $\\text{Conv}(\\cdot)$ reduces the number of channels,\r\n$\\text{Cov}(\\cdot)$ computes the covariance matrix and\r\n$\\text{RC}(\\cdot)$ means row-wise convolution.",
  "title": "Global Second-order Pooling Convolutional Networks",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "RFB",
  "full_name": "Receptive Field Block",
  "description": "**Receptive Field Block (RFB)** is a module for strengthening the deep features learned from lightweight CNN models so that they can contribute to fast and accurate detectors. Specifically, RFB makes use of multi-branch pooling with varying kernels corresponding to RFs of different sizes, applies [dilated convolution](https://paperswithcode.com/method/dilated-convolution) layers to control their eccentricities, and reshapes them to generate\r\nfinal representation.",
  "title": "Receptive Field Block Net for Accurate and Fast Object Detection",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "Wide Residual Block",
  "full_name": "Wide Residual Block",
  "description": "A **Wide Residual Block** is a type of [residual block](https://paperswithcode.com/method/residual-block) that utilises two conv 3x3 layers (with [dropout](https://paperswithcode.com/method/dropout)). This is wider than other variants of residual blocks (for instance [bottleneck residual blocks](https://paperswithcode.com/method/bottleneck-residual-block)). It was proposed as part of the [WideResNet](https://paperswithcode.com/method/wideresnet) CNN architecture.",
  "title": "Wide Residual Networks",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "Dropout",
  "full_name": "Dropout",
  "description": "**Dropout** is a regularization technique for neural networks that drops a unit (along with connections) at training time with a specified probability $p$ (a common value is $p=0.5$). At test time, all units are present, but with weights scaled by $p$ (i.e. $w$ becomes $pw$).\r\n\r\nThe idea is to prevent co-adaptation, where the neural network becomes too reliant on particular connections, as this could be symptomatic of overfitting. Intuitively, dropout can be thought of as creating an implicit ensemble of neural networks.",
  "title": "Dropout: A Simple Way to Prevent Neural Networks from Overfitting",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "M3L",
  "full_name": "Multi-modal Teacher for Masked Modality Learning",
  "description": "",
  "title": "Missing Modality Robustness in Semi-Supervised Multi-Modal Semantic Segmentation",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "TayPO",
  "full_name": "Taylor Expansion Policy Optimization",
  "description": "**TayPO**, or **Taylor Expansion Policy Optimization**, refers to a set of algorithms that apply the $k$-th order Taylor expansions for policy optimization. This generalizes prior work, including [TRPO](https://paperswithcode.com/method/trpo) as a special case. It can be thought of unifying ideas from trust-region policy optimization and off-policy corrections. Taylor expansions share high-level similarities with both trust region policy search and off-policy corrections. To get high-level intuitions of such similarities, consider a simple 1D example of Taylor expansions. Given a sufficiently smooth real-valued function on the real line $f : \\mathbb{R} \\rightarrow \\mathbb{R}$, the $k$-th order Taylor expansion of $f\\left(x\\right)$ at $x\\_{0}$ is \r\n\r\n$$f\\_{k}\\left(x\\right) = f\\left(x\\_{0}\\right)+\\sum^{k}\\_{i=1}\\left[f^{(i)}\\left(x\\_{0}\\right)/i!\\right]\\left(x−x\\_{0}\\right)^{i}$$\r\n\r\nwhere $f^{(i)}\\left(x\\_{0}\\right)$ are the $i$-th order derivatives at $x\\_{0}$. First, a common feature shared by Taylor expansions and trust-region policy search is the inherent notion of a trust region constraint. Indeed, in order for convergence to take place, a trust-region constraint is required $|x − x\\_{0}| < R\\left(f, x\\_{0}\\right)^{1}$. Second, when using the truncation as an approximation to the original function $f\\_{K}\\left(x\\right) \\approx f\\left(x\\right)$, Taylor expansions satisfy the requirement of off-policy evaluations: evaluate target policy with behavior data. Indeed, to evaluate the truncation $f\\_{K}\\left(x\\right)$ at any $x$ (target policy), we only require the behavior policy \"data\" at $x\\_{0}$ (i.e., derivatives $f^{(i)}\\left(x\\_{0}\\right)$).",
  "title": "Taylor Expansion Policy Optimization",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "StyleGAN",
  "full_name": "StyleGAN",
  "description": "**StyleGAN** is a type of generative adversarial network. It uses an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature; in particular, the use of [adaptive instance normalization](https://paperswithcode.com/method/adaptive-instance-normalization). Otherwise it follows Progressive [GAN](https://paperswithcode.com/method/gan) in using a progressively growing training regime. Other quirks include the fact it generates from a fixed value tensor not stochastically generated latent variables as in regular GANs. The stochastically generated latent variables are used as style vectors in the adaptive [instance normalization](https://paperswithcode.com/method/instance-normalization) at each resolution after being transformed by an 8-layer [feedforward network](https://paperswithcode.com/method/feedforward-network). Lastly, it employs a form of regularization called mixing regularization, which mixes two style latent variables during training.",
  "title": "A Style-Based Generator Architecture for Generative Adversarial Networks",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "LayoutReader",
  "full_name": "LayoutReader",
  "description": "** LayoutReader** is a sequence-to-sequence model for reading order detection that uses both textual and layout information, where the layout-aware language model [LayoutLM](https://paperswithcode.com/method/layoutlmv2) is leveraged as an encoder. The generation step in the encoder-decoder structure tis modified to generate the reading order sequence.\r\n\r\nIn the encoding stage, LayoutReader packs the pair of source and target segments into a contiguous input sequence of LayoutLM and carefully designs the [self-attention mask](https://paperswithcode.com/methods/category/factorized-attention) to control the visibility between tokens. As shown in the Figure, LayoutReader allows the tokens in the source segment to attend to each other while preventing the tokens in the target segment from attending to the rightward context. If 1 means allowing and 0 means preventing, the detail of the mask $M$ is as follows:\r\n\r\n$$ M\\_{i, j}= \\begin{cases}1, & \\text { if } i<j \\text { or } i, j \\in \\operatorname{src} \\\\ 0, & \\text { otherwise }\\end{cases} $$\r\n\r\nwhere $i, j$ are the indices in the packed input sequence, so they may be from source or target segments; $i, j \\in$ src means both tokens are from source segment.\r\n\r\nIn the decoding stage, since the source and target are reordered sequences, the prediction candidates can be constrained to the source segment. Therefore, we ask the model to predict the indices in the source sequence. The probability is calculated as follows:\r\n\r\n$$\r\n\\mathcal{P}\\left(x_{k}=i \\mid x_{<k}\\right)=\\frac{\\exp \\left(e_{i}^{T} h\\_{k}+b\\_{k}\\right)}{\\sum_{j} \\exp \\left(e\\_{j}^{T} h_{k}+b\\_{k}\\right)}\r\n$$\r\n\r\nwhere $i$ is an index in the source segment; $e\\_{i}$ and $e\\_{j}$ are the $\\mathrm{i}$-th and $\\mathrm{j}$-th input embeddings of the source segment; $h\\_{k}$ is the hidden states at the $\\mathrm{k}$-th time step; $b\\_{k}$ is the bias at the $\\mathrm{k}$-th time step.",
  "title": "LayoutReader: Pre-training of Text and Layout for Reading Order Detection",
  "collection": "Sequence To Sequence Models",
  "area": "Sequential"
}
{
  "name": "Depthwise Fire Module",
  "full_name": "Depthwise Fire Module",
  "description": "A **Depthwise Fire Module** is a modification of a [Fire Module](https://paperswithcode.com/method/fire-module) with depthwise separable convolutions to improve the inference time performance. It is used in the [CornerNet](https://paperswithcode.com/method/cornernet)-Lite architecture for object detection.",
  "title": "CornerNet-Lite: Efficient Keypoint Based Object Detection",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "RevNet",
  "full_name": "RevNet",
  "description": "A **Reversible Residual Network**, or **RevNet**, is a variant of a [ResNet](https://paperswithcode.com/method/resnet) where each layer’s activations can be reconstructed exactly from the next layer’s. Therefore, the activations for most layers need not be stored in memory during backpropagation. The result is a network architecture whose activation storage requirements are independent of depth, and typically at least an order of magnitude smaller compared with equally sized ResNets.\r\n\r\nRevNets are composed of a series of reversible blocks. Units in each layer are partitioned into two groups, denoted $x\\_{1}$ and $x\\_{2}$; the authors find what works best is partitioning the channels. Each reversible block takes inputs $\\left(x\\_{1}, x\\_{2}\\right)$ and produces outputs $\\left(y\\_{1}, y\\_{2}\\right)$ according to the following additive coupling rules – inspired the transformation in [NICE](https://paperswithcode.com/method/nice) (nonlinear independent components estimation) – and residual functions $F$ and $G$ analogous to those in standard ResNets:\r\n\r\n$$y\\_{1} = x\\_{1} + F\\left(x\\_{2}\\right)$$\r\n$$y\\_{2} = x\\_{2} + G\\left(y\\_{1}\\right)$$\r\n\r\nEach layer’s activations can be reconstructed from the next layer’s activations as follows:\r\n\r\n$$ x\\_{2} = y\\_{2} − G\\left(y\\_{1}\\right)$$\r\n$$ x\\_{1} = y\\_{1} − F\\left(x\\_{2}\\right)$$\r\n\r\nNote that unlike residual blocks, reversible blocks must have a stride of 1 because otherwise the layer\r\ndiscards information, and therefore cannot be reversible. Standard ResNet architectures typically\r\nhave a handful of layers with a larger stride. If we define a RevNet architecture analogously, the\r\nactivations must be stored explicitly for all non-reversible layers.",
  "title": "The Reversible Residual Network: Backpropagation Without Storing Activations",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Bilateral Guided Aggregation Layer",
  "full_name": "Bilateral Guided Aggregation Layer",
  "description": "**Bilateral Guided Aggregation Layer** is a feature fusion layer for semantic segmentation that aims to enhance mutual connections and fuse different types of feature representation. It was used in the [BiSeNet V2](https://paperswithcode.com/method/bisenet-v2) architecture. Specifically, within the BiSeNet implementation, the layer was used to employ the contextual information of the Semantic Branch to guide the feature response of Detail Branch. With different scale guidance, different scale feature representations can be captured, which inherently encodes the multi-scale information.",
  "title": "BiSeNet V2: Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation",
  "collection": "Semantic Segmentation Modules",
  "area": "Computer Vision"
}
{
  "name": "Levenshtein Transformer",
  "full_name": "Levenshtein Transformer",
  "description": "The **Levenshtein Transformer** (LevT) is a type of [transformer](https://paperswithcode.com/method/transformer) that aims to address the lack of flexibility of previous decoding models. Notably, in previous frameworks, the length of generated sequences is either fixed or monotonically increased as the decoding proceeds. The authors argue this is incompatible with human-level intelligence where humans can revise, replace, revoke or delete any part of their generated text. Hence, LevT is proposed to bridge this gap by breaking the in-so-far standardized decoding mechanism and replacing it with two basic operations — insertion and deletion.\r\n\r\nLevT is trained using imitation learning. The resulted model contains two policies and they are executed in an alternate manner. The authors argue that with this model decoding becomes more flexible. For example, when the decoder is given an empty token, it falls back to a normal sequence generation model. On the other hand, the decoder acts as a refinement model when the initial state is a low-quality generated sequence.\r\n\r\nOne crucial component in LevT framework is the learning algorithm. The authors leverage the characteristics of insertion and deletion — they are complementary but also adversarial. The algorithm they propose is called “dual policy learning”. The idea is that when training one policy (insertion or deletion), we use the output from its adversary at the previous iteration as input. An expert policy, on the other hand, is drawn to provide a correction signal.",
  "title": "Levenshtein Transformer",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Morphence",
  "full_name": "Morphence",
  "description": "**Morphence** is an approach for adversarial defense that shifts the defense landscape by making a model a moving target against adversarial examples. By regularly moving the decision function of a model, Morphence makes it significantly challenging for repeated or correlated attacks to succeed. Morphence deploys a pool of models generated from a base model in a manner that introduces sufficient randomness when it responds to prediction queries. To ensure repeated or correlated attacks fail, the deployed pool of models automatically expires after a query budget is reached and the model pool is replaced by a new model pool generated in advance.",
  "title": "Morphence: Moving Target Defense Against Adversarial Examples",
  "collection": "Adversarial Attacks",
  "area": "General"
}
{
  "name": "DV3 Attention Block",
  "full_name": "DV3 Attention Block",
  "description": "**DV3 Attention Block** is an attention-based module used in the [Deep Voice 3](https://paperswithcode.com/method/deep-voice-3) architecture. It uses a [dot-product attention](https://paperswithcode.com/method/dot-product-attention) mechanism. A query vector (the hidden states of the decoder) and the per-timestep key vectors from the encoder are used to compute attention weights. This then outputs a context vector computed as the weighted average of the value vectors.",
  "title": "Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning",
  "collection": "Audio Model Blocks",
  "area": "Audio"
}
{
  "name": "COP-KMeans",
  "full_name": "Constrained Pairwise k-Means",
  "description": "COP-KMeans is a modified version the popular k-means algorithm that supports pairwise constraints. \r\n\r\nOriginal paper : Constrained K-means Clustering with Background Knowledge, Wagstaff et al. 2001",
  "title": null,
  "collection": "Clustering",
  "area": "General"
}
{
  "name": "DetNAS",
  "full_name": "DetNAS",
  "description": "**DetNAS** is a [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) algorithm for the design of better backbones for object detection. It is based on the technique of one-shot supernet, which contains all possible networks in the search space. The supernet is trained under the typical detector training schedule: ImageNet pre-training and detection fine-tuning. Then, the architecture search is performed on the trained supernet, using the detection task as the guidance. DetNAS uses evolutionary search as opposed to RL-based methods or gradient-based methods.",
  "title": "DetNAS: Backbone Search for Object Detection",
  "collection": "Neural Architecture Search",
  "area": "General"
}
{
  "name": "Mix-FFN",
  "full_name": "Mix-FFN",
  "description": "**Mix-FFN** is a feedforward layer used in the [SegFormer](https://paperswithcode.com/method/segformer) architecture. [ViT](https://www.paperswithcode.com/method/vision-transformer) uses [positional encoding](https://paperswithcode.com/methods/category/position-embeddings) (PE) to introduce the location information. However, the resolution of $\\mathrm{PE}$ is fixed. Therefore, when the test resolution is different from the training one, the positional code needs to be interpolated and this often leads to dropped accuracy. To alleviate this problem, [CPVT](https://www.paperswithcode.com/method/cpvt) uses $3 \\times 3$ Conv together with the PE to implement a data-driven PE. The authors of Mix-FFN argue that positional encoding is actually not necessary for semantic segmentation. Instead, they use Mix-FFN which considers the effect of zero padding to leak location information, by directly using a $3 \\times 3$ Conv in the feed-forward network (FFN). Mix-FFN can be formulated as:\r\n\r\n$$\r\n\\mathbf{x}\\_{\\text {out }}=\\operatorname{MLP}\\left(\\operatorname{GELU}\\left(\\operatorname{Conv}\\_{3 \\times 3}\\left(\\operatorname{MLP}\\left(\\mathbf{x}\\_{i n}\\right)\\right)\\right)\\right)+\\mathbf{x}\\_{i n}\r\n$$\r\n\r\nwhere $\\mathbf{x}\\_{i n}$ is the feature from a self-attention module. Mix-FFN mixes a $3 \\times 3$ convolution and an MLP into each FFN.",
  "title": "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers",
  "collection": "Feedforward Networks",
  "area": "General"
}
{
  "name": "BAM",
  "full_name": "Bottleneck Attention Module",
  "description": "Park et al. proposed the bottleneck attention module (BAM), aiming\r\nto efficiently improve the representational capability of networks. \r\nIt uses dilated convolution to enlarge the receptive field of the spatial attention sub-module, and build a bottleneck structure as suggested  by ResNet to save computational cost.\r\n\r\nFor a given input feature map $X$, BAM infers the channel attention $s_c \\in \\mathbb{R}^C$ and spatial attention $s_s\\in \\mathbb{R}^{H\\times W}$ in two parallel streams, then sums the two attention maps after resizing both branch outputs to $\\mathbb{R}^{C\\times H \\times W}$. The channel attention branch, like an SE block, applies global average pooling to the feature map to aggregate global information, and then uses an MLP with channel dimensionality reduction. In order to utilize contextual information effectively, the spatial attention branch combines a bottleneck structure and dilated convolutions. Overall, BAM can be written as\r\n\\begin{align}\r\n    s_c &= \\text{BN}(W_2(W_1\\text{GAP}(X)+b_1)+b_2)\r\n\\end{align}\r\n\r\n\\begin{align}\r\n    s_s &= BN(Conv_2^{1 \\times 1}(DC_2^{3\\times 3}(DC_1^{3 \\times 3}(Conv_1^{1 \\times 1}(X))))) \r\n\\end{align}\r\n\\begin{align}\r\n    s &= \\sigma(\\text{Expand}(s_s)+\\text{Expand}(s_c)) \r\n\\end{align}\r\n\\begin{align}\r\n    Y &= s X+X\r\n\\end{align}\r\nwhere $W_i$, $b_i$ denote  weights and biases of fully connected layers respectively, $Conv_{1}^{1\\times 1}$ and $Conv_{2}^{1\\times 1}$ are convolution layers  used for channel reduction. $DC_i^{3\\times 3}$ denotes a dilated convolution with $3\\times 3$ kernel,  applied to utilize contextual information effectively. $\\text{Expand}$ expands the attention maps $s_s$ and $s_c$ to $\\mathbb{R}^{C\\times H\\times W}$.\r\n\r\nBAM can emphasize or suppress features in both spatial and channel dimensions, as well as improving the representational power. Dimensional reduction applied to both channel and spatial attention branches enables it to be integrated with any convolutional neural network with little extra computational cost. However, although dilated convolutions enlarge the receptive field effectively, it still fails to capture long-range contextual information as well as encoding cross-domain relationships.",
  "title": "BAM: Bottleneck Attention Module",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "GCNFN",
  "full_name": "Graph Convolutional Networks for Fake News Detection",
  "description": "Social media are nowadays one of the main news sources for millions of people around the globe due to their low cost, easy access and rapid dissemination. This however comes at the cost of dubious trustworthiness and significant risk of exposure to 'fake news', intentionally written to mislead the readers. Automatically detecting fake news poses challenges that defy existing content-based analysis approaches. One of the main reasons is that often the interpretation of the news requires the knowledge of political or social context or 'common sense', which current NLP algorithms are still missing. Recent studies have shown that fake and real news spread differently on social media, forming propagation patterns that could be harnessed for the automatic fake news detection. Propagation-based approaches have multiple advantages compared to their content-based counterparts, among which is language independence and better resilience to adversarial attacks. In this paper we show a novel automatic fake news detection model based on geometric deep learning. The underlying core algorithms are a generalization of classical CNNs to graphs, allowing the fusion of heterogeneous data such as content, user profile and activity, social graph, and news propagation. Our model was trained and tested on news stories, verified by professional fact-checking organizations, that were spread on Twitter. Our experiments indicate that social network structure and propagation are important features allowing highly accurate (92.7% ROC AUC) fake news detection. Second, we observe that fake news can be reliably detected at an early stage, after just a few hours of propagation. Third, we test the aging of our model on training and testing data separated in time. Our results point to the promise of propagation-based approaches for fake news detection as an alternative or complementary strategy to content-based approaches.",
  "title": "Fake News Detection on Social Media using Geometric Deep Learning",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "AdaHessian",
  "full_name": "ADAHESSIAN",
  "description": "AdaHessian achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods, including variants of [ADAM](https://paperswithcode.com/method/adam). In particular, we perform extensive tests on CV, NLP, and recommendation system tasks and find that AdaHessian: (i) achieves 1.80%/1.45% higher accuracy on ResNets20/32 on Cifar10, and 5.55% higher accuracy on ImageNet as compared to ADAM; (ii) outperforms ADAMW for transformers by 0.27/0.33 BLEU score on IWSLT14/WMT14 and 1.8/1.0 PPL on PTB/Wikitext-103; and (iii) achieves 0.032% better score than [AdaGrad](https://paperswithcode.com/method/adagrad) for DLRM on the Criteo Ad Kaggle dataset. Importantly, we show that the cost per iteration of AdaHessian is comparable to first-order methods, and that it exhibits robustness towards its hyperparameters.",
  "title": "ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning",
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "PointAugment",
  "full_name": "PointAugment",
  "description": "**PointAugment** is a an auto-augmentation framework that automatically optimizes and augments point cloud samples to enrich the data diversity when we train a classification network. Different from existing auto-augmentation methods for 2D images, PointAugment is sample-aware and takes an adversarial learning strategy to jointly optimize an augmentor network and a classifier network, such that the augmentor can learn to produce augmented samples that best fit the classifier.",
  "title": "PointAugment: an Auto-Augmentation Framework for Point Cloud Classification",
  "collection": "Point Cloud Augmentation",
  "area": "Computer Vision"
}
{
  "name": "DFA",
  "full_name": "Direct Feedback Alignment",
  "description": "",
  "title": "Direct Feedback Alignment Provides Learning in Deep Neural Networks",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "GRoIE",
  "full_name": "Generic RoI Extractor",
  "description": "**GroIE** is an RoI extractor which intends to overcome the limitation of existing extractors which select only one (the best) layer from the [FPN](https://paperswithcode.com/method/fpn). The intuition is that all the layers of FPN retain useful\r\ninformation. Therefore, the proposed layer introduces non-local building blocks and attention mechanisms to boost the performance.",
  "title": "A novel Region of Interest Extraction Layer for Instance Segmentation",
  "collection": "RoI Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "PaLM",
  "full_name": "Pathways Language Model",
  "description": "**PaLM** (**Pathways Language Model**) uses a standard Transformer model architecture (Vaswani et al., 2017) in a decoder-only setup (i.e., each timestep can only attend to itself and past timesteps), with several modifications. PaLM is trained as a 540 billion parameter, densely activated, autoregressive Transformer on 780 billion tokens. PaLM leverages Pathways (Barham et al., 2022), which enables highly efficient training of very large neural networks across thousands of accelerator chips.\r\n\r\nImage credit: [PaLM: Scaling Language Modeling with Pathways](https://paperswithcode.com/paper/palm-scaling-language-modeling-with-pathways-1)",
  "title": "PaLM: Scaling Language Modeling with Pathways",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "PrivacyNet",
  "full_name": "PrivacyNet",
  "description": "**PrivacyNet** is a [GAN](https://paperswithcode.com/method/gan)-based semi-adversarial network (SAN) that modifies an input face image such that it can be used by a face matcher for matching purposes but cannot be reliably used by an attribute classifier. PrivacyNet allows a person to choose specific attributes that have to be obfuscated in the input face images (e.g., age and race), while allowing for other types of attributes to be extracted (e.g., gender).",
  "title": "PrivacyNet: Semi-Adversarial Networks for Multi-attribute Face Privacy",
  "collection": "Face Privacy",
  "area": "Computer Vision"
}
{
  "name": "CLIPort",
  "full_name": "CLIPort",
  "description": "CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding (what) of CLIP [1] with the spatial precision (where) of Transporter [2].",
  "title": "CLIPort: What and Where Pathways for Robotic Manipulation",
  "collection": "Imitation Learning Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "VGGoptiVMD",
  "full_name": "VGG and variational Model Decomposition",
  "description": "",
  "title": "10 hours data is all you need",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Adapter",
  "full_name": "Adapter",
  "description": "",
  "title": "Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing",
  "collection": "Feedforward Networks",
  "area": "General"
}
{
  "name": "WGAN-GP Loss",
  "full_name": "WGAN-GP Loss",
  "description": "**Wasserstein Gradient Penalty Loss**, or **WGAN-GP Loss**, is a loss used for generative adversarial networks that augments the Wasserstein loss with a gradient norm penalty for random samples $\\mathbf{\\hat{x}} \\sim \\mathbb{P}\\_{\\hat{\\mathbf{x}}}$ to achieve Lipschitz continuity:\r\n\r\n$$ L = \\mathbb{E}\\_{\\mathbf{\\hat{x}} \\sim \\mathbb{P}\\_{g}}\\left[D\\left(\\tilde{\\mathbf{x}}\\right)\\right] - \\mathbb{E}\\_{\\mathbf{x} \\sim \\mathbb{P}\\_{r}}\\left[D\\left(\\mathbf{x}\\right)\\right] + \\lambda\\mathbb{E}\\_{\\mathbf{\\hat{x}} \\sim \\mathbb{P}\\_{\\hat{\\mathbf{x}}}}\\left[\\left(||\\nabla\\_{\\tilde{\\mathbf{x}}}D\\left(\\mathbf{\\tilde{x}}\\right)||\\_{2}-1\\right)^{2}\\right]$$\r\n\r\nIt was introduced as part of the [WGAN-GP](https://paperswithcode.com/method/wgan-gp) overall model.",
  "title": "Improved Training of Wasserstein GANs",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "VoTr",
  "full_name": "Voxel Transformer",
  "description": "**VoTr** is a [Transformer](https://paperswithcode.com/method/transformer)-based 3D backbone for 3D object detection from point clouds. It contains a series of sparse and submanifold voxel modules. Submanifold voxel modules perform multi-head self-attention strictly on the non-empty voxels, while sparse voxel modules can extract voxel features at empty locations. Long-range relationships between voxels are captured via self-attention.\r\n\r\nGiven the fact that non-empty voxels are naturally sparse but numerous, directly applying standard Transformer on voxels is non-trivial. To this end, VoTr uses a sparse voxel module and a submanifold voxel module, which can operate on the empty and non-empty voxel positions effectively. To further enlarge the attention range while maintaining comparable computational overhead to the convolutional counterparts, two attention mechanisms are used for [multi-head attention](https://paperswithcode.com/method/multi-head-attention) in those two modules: Local Attention and Dilated Attention. Furthermore [Fast Voxel Query](https://paperswithcode.com/method/fast-voxel-query) is used to accelerate the querying process in multi-head attention.",
  "title": "Voxel Transformer for 3D Object Detection",
  "collection": "3D Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "NUQSGD",
  "full_name": "Nonuniform Quantization for Stochastic Gradient Descent",
  "description": "As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel [SGD](https://paperswithcode.com/method/sgd) is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD provides strong theoretical guarantees, however, for practical purposes, the authors proposed a heuristic variant which we call QSGDinf, which demonstrated impressive empirical gains for distributed training of large neural networks. In this paper, we build on this work to propose a new gradient quantization scheme, and show that it has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of the QSGDinf heuristic and of other compression methods.",
  "title": "NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization",
  "collection": "Data Parallel Methods",
  "area": "General"
}
{
  "name": "SRU++",
  "full_name": "SRU++",
  "description": "**SRU++** is a self-attentive recurrent unit that combines fast recurrence and attention for sequence modeling, extending the [SRU](https://www.paperswithcode.com/method/sru) unit. The key modification of SRU++ is to incorporate more expressive non-linear operations into the recurrent network. Specifically, given the input sequence represented as a matrix $\\mathbf{X} \\in \\mathbb{R}^{L \\times d}$, the attention component computes the query, key and value representations using the following multiplications,\r\n\r\n$$\r\n\\mathbf{Q} =\\mathbf{W}^{q} \\mathbf{X}^{\\top} \r\n$$\r\n\r\n$$\r\n\\mathbf{K} =\\mathbf{W}^{k} \\mathbf{Q} \\\\\r\n$$\r\n\r\n$$\r\n\\mathbf{V} =\\mathbf{W}^{v} \\mathbf{Q}\r\n$$\r\n\r\nwhere $\\mathbf{W}^{q} \\in \\mathbb{R}^{d^{\\prime} \\times d}, \\mathbf{W}^{k}, \\mathbf{W}^{v} \\in \\mathbb{R}^{d^{\\prime} \\times d^{\\prime}}$ are model parameters. $d^{\\prime}$ is the attention dimension that is typically much smaller than $d$. Note that the keys $\\mathbf{K}$ and values $\\mathbf{V}$ are computed using $\\mathbf{Q}$ instead of $\\mathbf{X}$ such that the weight matrices $\\mathbf{W}^{k}$ and $\\mathbf{W}^{v}$ are significantly smaller. \r\n\r\nNext, we compute a weighted average output $\\mathbf{A} \\in \\mathbb{R}^{d^{\\prime} \\times L}$ using [scaled dot-product attention](https://paperswithcode.com/method/scaled):\r\n\r\n$$\r\n\\mathbf{A}^{\\top}=\\operatorname{softmax}\\left(\\frac{\\mathbf{Q}^{\\top} \\mathbf{K}}{\\sqrt{d^{\\prime}}}\\right) \\mathbf{V}^{\\top}\r\n$$\r\n\r\nThe final output $U$ required by the elementwise recurrence is obtained by another linear projection,\r\n\r\n$$\r\n\\mathbf{U}^{\\top}=\\mathbf{W}^{o}(\\mathbf{Q}+\\alpha \\cdot \\mathbf{A})\r\n$$\r\n\r\nwhere $\\alpha \\in \\mathbb{R}$ is a learned scalar and $\\mathbf{W}\\_{o} \\in \\mathbb{R}^{3 d \\times d^{\\prime}}$ is a parameter matrix. $\\mathbf{Q}+\\alpha \\cdot \\mathbf{A}$ is a [residual connection](https://paperswithcode.com/method/residual-connection) which improves gradient propagation and stabilizes training. We initialize $\\alpha$ to zero and as a result,\r\n\r\n$$\r\n\\mathbf{U}^{\\top}=\\mathbf{W}^{o} \\mathbf{Q}=\\left(\\mathbf{W}^{o} \\mathbf{W}^{q}\\right) \\mathbf{X}^{\\top}\r\n$$\r\n\r\ninitially falls back to a linear transformation of the input $X$ skipping the attention transformation. Intuitively, skipping attention encourages leveraging recurrence to capture sequential patterns during early stage of training. As $|\\alpha|$ grows, the attention mechanism can learn long-range dependencies for the model. In addition, $\\mathbf{W}^{o} \\mathbf{W}^{q}$ can be interpreted as applying a matrix factorization trick with a small inner dimension $d^{\\prime}<d$, reducing the total number of parameters. The Figure compares the differences of SRU, SRU with this factorization trick (but without attention), and SRU++.\r\n\r\nThe last modification is adding [layer normalization](https://paperswithcode.com/method/layer-normalization) to each SRU++ layer. We apply normalization after the attention operation and before the matrix multiplication with $\\mathbf{W}^{o}$\r\n\r\n$$\r\n\\mathbf{U}^{\\top}=\\mathbf{W}^{o} \\operatorname{layernorm}(\\mathbf{Q}+\\alpha \\cdot \\mathbf{A})\r\n$$\r\n\r\nThis implementation is post-layer normalization in which the normalization is added after the residual connection.",
  "title": "When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "Gradient Clipping",
  "full_name": "Gradient Clipping",
  "description": "One difficulty that arises with optimization of deep neural networks is that large parameter gradients can lead an [SGD](https://paperswithcode.com/method/sgd) optimizer to update the parameters strongly into a region where the loss function is much greater, effectively undoing much of the work that was needed to get to the current solution.\r\n\r\n**Gradient Clipping** clips the size of the gradients to ensure optimization performs more reasonably near sharp areas of the loss surface. It can be performed in a number of ways. One option is to simply clip the parameter gradient element-wise before a parameter update. Another option is to clip the norm ||$\\textbf{g}$|| of the gradient $\\textbf{g}$ before a parameter update:\r\n\r\n$$\\text{ if } ||\\textbf{g}||  > v \\text{ then } \\textbf{g} \\leftarrow \\frac{\\textbf{g}{v}}{||\\textbf{g}||}$$\r\n\r\nwhere $v$ is a norm threshold.\r\n\r\nSource: Deep Learning, Goodfellow et al\r\n\r\nImage Source: [Pascanu et al](https://arxiv.org/pdf/1211.5063.pdf)",
  "title": null,
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "SPS",
  "full_name": "Semi-Pseudo-Label",
  "description": "",
  "title": "A Novel Neural Network Training Method for Autonomous Driving Using Semi-Pseudo-Labels and 3D Data Augmentations",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "TSDAE",
  "full_name": "TSDAE",
  "description": "**TSDAE** is an unsupervised sentence embedding method. During training, TSDAE encodes corrupted sentences into fixed-sized vectors and requires the decoder to reconstruct the original sentences from this sentence embedding. For good reconstruction quality, the semantics must be captured well in the sentence embedding from the encoder. Later, at inference, we only use the encoder for creating sentence embeddings.\r\n\r\nThe model architecture of TSDAE is a modified [encoder-decoder Transformer](https://paperswithcode.com/methods/category/autoencoding-transformers) where the key and value of the cross-attention are both confined to the sentence embedding only. Formally, the formulation of the modified cross-attention is:\r\n\r\n$$\r\nH^{(k)}=\\text { Attention }\\left(H^{(k-1)},\\left[s^{T}\\right],\\left[s^{T}\\right]\\right)\r\n$$\r\n\r\n$$\r\n\\operatorname{Attention}(Q, K, V)=\\operatorname{softmax}\\left(\\frac{Q K^{T}}{\\sqrt{d}}\\right) V\r\n$$\r\n\r\nwhere $H^{(k)} \\in \\mathbb{R}^{t \\times d}$ is the decoder hidden states within $t$ decoding steps at the $k$-th layer, $d$ is the size of the sentence embedding, $\\left[s^{T}\\right] \\in \\mathbb{R}^{1 \\times d}$ is a one-row matrix including the sentence embedding vector and $Q, K$ and $V$ are the query, key and value, respectively. By exploring different configurations on the STS benchmark dataset, the authors discover that the best combination is: (1) adopting deletion as the input noise and setting the deletion ratio to $0.6,(2)$ using the output of the [CLS] token as fixed-sized sentence representation (3) tying the encoder and decoder parameters during training.",
  "title": "TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning",
  "collection": "Sentence Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "Lecun's Tanh",
  "full_name": "Lecun's Tanh",
  "description": "**LeCun's tanh** is an activation function of the form $f\\left(x\\right) = 1.7159\\tanh\\left(\\frac{2}{3}x\\right)$. The constants were chosen to keep the variance of the output close to 1.",
  "title": null,
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "SAC",
  "full_name": "Switchable Atrous Convolution",
  "description": "**Switchable Atrous Convolution (SAC)** softly switches the convolutional computation between different atrous rates and gathers the results using switch functions. The switch functions are spatially dependent, i.e., each location of the feature map might have different switches to control the outputs of SAC. To use SAC in a detector, we convert all the standard 3x3 convolutional layers in the bottom-up backbone to SAC.",
  "title": "DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "1-bit Adam",
  "full_name": "1-bit Adam",
  "description": "**1-bit Adam** is a [stochastic optimization](https://paperswithcode.com/methods/category/stochastic-optimization) technique that is a variant of [ADAM](https://paperswithcode.com/method/adam) with error-compensated 1-bit compression, based on finding that Adam's variance term becomes stable at an early stage. First vanilla Adam is used for a few epochs as a warm-up. After the warm-up stage, the compression stage starts and we stop updating the variance term $\\mathbf{v}$ and use it as a fixed precondition. At the compression stage, we communicate based on the momentum applied with error-compensated 1-bit compression. The momentums are quantized into 1-bit representation (the sign of each element). Accompanying the vector, a scaling factor is computed as $\\frac{\\text { magnitude of compensated gradient }}{\\text { magnitude of quantized gradient }}$. This scaling factor ensures that the compressed momentum has the same magnitude as the uncompressed momentum. This 1-bit compression could reduce the communication cost by $97 \\%$ and $94 \\%$ compared to the original float 32 and float 16 training, respectively.",
  "title": "1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Boost-GNN",
  "full_name": "Boost-GNN",
  "description": "**Boost-GNN** is an architecture that trains GBDT and GNN jointly to get the best of both worlds: the GBDT model deals with heterogeneous features, while GNN accounts for the graph structure. The model benefits from end-to-end optimization by allowing new trees to fit the gradient updates of GNN.",
  "title": "Boost then Convolve: Gradient Boosting Meets Graph Neural Networks",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "RMSNorm",
  "full_name": "Root Mean Square Layer Normalization",
  "description": "RMSNorm regularizes the summed inputs to a neuron in one layer according to root mean square (RMS), giving the model re-scaling invariance property and implicit learning rate adaptation ability. RMSNorm is computationally simpler and thus more efficient than LayerNorm.",
  "title": "Root Mean Square Layer Normalization",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "Noisy Linear Layer",
  "full_name": "Noisy Linear Layer",
  "description": "A **Noisy Linear Layer** is a [linear layer](https://paperswithcode.com/method/linear-layer) with parametric noise added to the weights. This induced stochasticity can be used in reinforcement learning networks for the agent's policy to aid efficient exploration. The parameters of the noise are learned with gradient descent along with any other remaining network weights. Factorized Gaussian noise is the type of noise usually employed.\r\n\r\nThe noisy linear layer takes the form:\r\n\r\n$$y = \\left(b + Wx\\right) + \\left(b\\_{noisy}\\odot\\epsilon^{b}+\\left(W\\_{noisy}\\odot\\epsilon^{w}\\right)x\\right) $$\r\n\r\nwhere $\\epsilon^{b}$ and $\\epsilon^{w}$ are random variables.",
  "title": "Noisy Networks for Exploration",
  "collection": "Randomized Value Functions",
  "area": "Reinforcement Learning"
}
{
  "name": "GAN Least Squares Loss",
  "full_name": "GAN Least Squares Loss",
  "description": "**GAN Least Squares Loss** is a least squares loss function for generative adversarial networks. Minimizing this objective function is equivalent to minimizing the Pearson $\\chi^{2}$ divergence. The objective function (here for [LSGAN](https://paperswithcode.com/method/lsgan)) can be defined as:\r\n\r\n$$ \\min\\_{D}V\\_{LS}\\left(D\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{x} \\sim p\\_{data}\\left(\\mathbf{x}\\right)}\\left[\\left(D\\left(\\mathbf{x}\\right) - b\\right)^{2}\\right] + \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z}\\sim p\\_{data}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - a\\right)^{2}\\right] $$\r\n\r\n$$ \\min\\_{G}V\\_{LS}\\left(G\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z} \\sim p\\_{\\mathbf{z}}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - c\\right)^{2}\\right] $$\r\n\r\nwhere $a$ and $b$ are the labels for fake data and real data and $c$ denotes the value that $G$ wants $D$ to believe for fake data.",
  "title": "Least Squares Generative Adversarial Networks",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "OPT",
  "full_name": "OPT",
  "description": "**OPT** is a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters. The model uses an AdamW optimizer and weight decay of 0.1. It follows a linear learning rate schedule, warming up from 0 to the maximum learning rate over the first 2000 steps in OPT-175B, or over 375M tokens in the smaller models, and decaying down to 10% of the maximum LR over 300B tokens. The batch sizes range from 0.5M to 4M depending on the model size and is kept constant throughout the course of training.",
  "title": "OPT: Open Pre-trained Transformer Language Models",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "Annealing SNNL",
  "full_name": "Soft Nearest Neighbor Loss with Annealing Temperature",
  "description": "",
  "title": "Improving k-Means Clustering Performance with Disentangled Internal Representations",
  "collection": "Image Representations",
  "area": "Computer Vision"
}
{
  "name": "APPO",
  "full_name": "Asynchronous Proximal Policy Optimization",
  "description": "",
  "title": "Sample Factory: Egocentric 3D Control from Pixels at 100000 FPS with Asynchronous Reinforcement Learning",
  "collection": "Distributed Reinforcement Learning",
  "area": "Reinforcement Learning"
}
{
  "name": "mBARTHez",
  "full_name": "mBARTHez",
  "description": "**BARThez** is a self-supervised transfer learning model for the French language based on [BART](https://paperswithcode.com/method/bart). Compared to existing [BERT](https://paperswithcode.com/method/bert)-based French language models such as [CamemBERT](https://paperswithcode.com/paper/camembert-a-tasty-french-language-model) and [FlauBERT](https://paperswithcode.com/paper/flaubert-unsupervised-language-model-pre), BARThez is well-suited for generative tasks, since not only its encoder but also its decoder is pretrained.",
  "title": "BARThez: a Skilled Pretrained French Sequence-to-Sequence Model",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "EMF",
  "full_name": "Enhanced-Multimodal Fuzzy Framework",
  "description": "BCI MI framework to classifiy brain signals using a multimodal decission making phase, with an addtional differentiation of the signal.",
  "title": "Motor-Imagery-Based Brain Computer Interface using Signal Derivation and Aggregation Functions",
  "collection": "Time Series Analysis",
  "area": "Sequential"
}
{
  "name": "Seesaw Loss",
  "full_name": "Seesaw Loss",
  "description": "**Seesaw Loss** is a loss function for long-tailed instance segmentation. It dynamically re-balances the gradients of positive and negative samples on a tail class with two complementary factors: mitigation factor and compensation factor. The mitigation factor reduces punishments to tail categories w.r.t the ratio of cumulative training instances between different categories. Meanwhile, the compensation factor increases the penalty of misclassified instances to avoid false positives of tail categories. The synergy of the two factors enables Seesaw Loss to mitigate the overwhelming punishments on tail classes as well as compensate for the risk of misclassification caused by diminished penalties.\r\n\r\n$$ L\\_{seesaw}\\left(\\mathbf{x}\\right) = - \\sum^{C}\\_{i=1}y\\_{i}\\log\\left(\\hat{\\sigma}\\_{i}\\right) $$\r\n\r\n$$ \\text{with } \\hat{\\sigma\\_{i}} = \\frac{e^{z\\_{i}}}{- \\sum^{C}\\_{j\\neq{1}}\\mathcal{S}\\_{ij}e^{z\\_{j}}+e^{z\\_{i}} } $$\r\n\r\nHere $\\mathcal{S}\\_{ij}$ works as a tunable balancing factor between different classes. By a careful design of $\\mathcal{S}\\_{ij}$, Seesaw loss adjusts the punishments on class j from positive samples of class $i$. Seesaw loss determines $\\mathcal{S}\\_{ij}$ by a mitigation factor and a compensation factor, as:\r\n\r\n$$ \\mathcal{S}\\_{ij} =\\mathcal{M}\\_{ij} · \\mathcal{C}\\_{ij}  $$\r\n\r\nThe mitigation factor $\\mathcal{M}\\_{ij}$ decreases the penalty on tail class $j$ according to a ratio of instance numbers between tail class $j$ and head class $i$. The compensation factor $\\mathcal{C}\\_{ij}$ increases the penalty on class $j$ whenever an instance of class $i$ is misclassified to class $j$.",
  "title": "Seesaw Loss for Long-Tailed Instance Segmentation",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "Orthogonal Regularization",
  "full_name": "Orthogonal Regularization",
  "description": "**Orthogonal Regularization** is a regularization technique for convolutional neural networks, introduced with generative modelling as the task in mind. Orthogonality is argued to be a desirable quality in ConvNet filters, partially because multiplication by an orthogonal matrix leaves the norm of the original matrix unchanged. This property is valuable in deep or recurrent networks, where repeated matrix multiplication can result in signals vanishing or exploding. To try to maintain orthogonality throughout training, Orthogonal Regularization encourages weights to be orthogonal by pushing them towards the nearest orthogonal manifold. The objective function is augmented with the cost:\r\n\r\n$$ \\mathcal{L}\\_{ortho} = \\sum\\left(|WW^{T} − I|\\right) $$\r\n\r\nWhere $\\sum$ indicates a sum across all filter banks, $W$ is a filter bank, and $I$ is the identity matrix",
  "title": "Neural Photo Editing with Introspective Adversarial Networks",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "WaveNet",
  "full_name": "WaveNet",
  "description": "**WaveNet** is an audio generative model based on the [PixelCNN](https://paperswithcode.com/method/pixelcnn) architecture. In order to deal with long-range temporal dependencies needed for raw audio generation, architectures are developed based on dilated causal convolutions, which exhibit very large receptive fields.\r\n\r\nThe joint probability of a waveform $\\vec{x} = \\{ x_1, \\dots, x_T \\}$ is factorised as a product of conditional probabilities as follows:\r\n\r\n$$p\\left(\\vec{x}\\right) = \\prod_{t=1}^{T} p\\left(x_t \\mid x_1, \\dots ,x_{t-1}\\right)$$\r\n\r\nEach audio sample $x_t$ is therefore conditioned on the samples at all previous timesteps.",
  "title": "WaveNet: A Generative Model for Raw Audio",
  "collection": "Generative Audio Models",
  "area": "Audio"
}
{
  "name": "ProCAN",
  "full_name": "Progressive Growing Channel Attentive Non-Local Network",
  "description": "Lung cancer classification in screening computed tomography (CT) scans is one of the most crucial tasks for early detection of this disease. Many lives can be saved if we are able to accurately classify malignant/cancerous lung nodules. Consequently, several deep learning based models have been proposed recently to classify lung nodules as malignant or benign. Nevertheless, the large variation in the size and heterogeneous appearance of the nodules makes this task an extremely challenging one. We propose a new Progressive Growing Channel Attentive Non-Local (ProCAN) network for lung nodule classification. The proposed method addresses this challenge from three different aspects. First, we enrich the Non-Local network by adding channel-wise attention capability to it. Second, we apply Curriculum Learning principles, whereby we first train our model on easy examples before hard ones. Third, as the classification task gets harder during the Curriculum learning, our model is progressively grown to increase its capability of handling the task at hand. We examined our proposed method on two different public datasets and compared its performance with state-of-the-art methods in the literature. The results show that the ProCAN model outperforms state-of-the-art methods and achieves an AUC of 98.05% and an accuracy of 95.28% on the LIDC-IDRI dataset. Moreover, we conducted extensive ablation studies to analyze the contribution and effects of each new component of our proposed method.",
  "title": "ProCAN: Progressive Growing Channel Attentive Non-Local Network for Lung Nodule Classification",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "ReGLU",
  "full_name": "ReGLU",
  "description": "**ReGLU** is an activation function which is a variant of [GLU](https://paperswithcode.com/method/glu). The definition is as follows:\r\n\r\n$$ \\text{ReGLU}\\left(x, W, V, b, c\\right) = \\max\\left(0, xW + b\\right) \\otimes \\left(xV + c\\right) $$",
  "title": "GLU Variants Improve Transformer",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "SRDC",
  "full_name": "Structurally Regularized Deep Clustering",
  "description": "**Structurally Regularized Deep Clustering**, or **SRDC**, is a deep network based discriminative clustering method for domain adaptation that minimizes the KL divergence between predictive label distribution of the network and an introduced auxiliary one. Replacing the auxiliary distribution with that formed by ground-truth labels of source data implements the structural source regularization via a simple strategy of joint network training.",
  "title": "Unsupervised Domain Adaptation via Structurally Regularized Deep Clustering",
  "collection": "Domain Adaptation",
  "area": "General"
}
{
  "name": "Twins-SVT",
  "full_name": "Twins-SVT",
  "description": "**Twins-SVT** is a type of [vision transformer](https://paperswithcode.com/methods/category/vision-transformer) which utilizes a [spatially separable attention mechanism](https://paperswithcode.com/method/spatially-separable-self-attention) (SSAM) which is composed of two types of attention operations—(i) locally-grouped self-attention (LSA), and (ii) global sub-sampled attention (GSA), where LSA captures the fine-grained and short-distance information and GSA deals with the long-distance and global information. On top of this, it utilizes [conditional position encodings](https://paperswithcode.com/method/conditional-positional-encoding) as well as the architectural design of the [Pyramid Vision Transformer](https://paperswithcode.com/method/pvt).",
  "title": "Twins: Revisiting the Design of Spatial Attention in Vision Transformers",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "MiVOS",
  "full_name": "Modular Interactive VOS",
  "description": "**MiVOS** is a video object segmentation model which decouples interaction-to-mask and mask propagation. By decoupling interaction from propagation, MiVOS is versatile and not limited by the type of interactions. It uses three modules: Interaction-to-Mask, Propagation and Difference-Aware Fusion. Trained separately, the interaction module converts user interactions to an object mask, which is then temporally propagated by our propagation module using a novel top-filtering strategy in reading the space-time memory. To effectively take the user's intent into account, a novel difference-aware module is proposed to learn how to properly fuse the masks before and after each interaction, which are aligned with the target frames by employing the space-time memory.",
  "title": "Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion",
  "collection": "Video Object Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "Voxel R-CNN",
  "full_name": "Voxel R-CNN",
  "description": "**Voxel R-CNN** is a voxel-based two stage framework for 3D object detection. It consists of a 3D backbone network, a 2D bird-eye-view (BEV) Region Proposal Network and a detect head. Voxel RoI Pooling is devised to extract RoI features directly from raw features for further refinement. \r\n\r\nEnd-to-end, the point clouds are first divided into regular voxels and fed into the 3D backbone network for feature extraction. Then, the 3D feature volumes are converted into BEV representation, on which the 2D backbone and [RPN](https://paperswithcode.com/method/rpn) are applied for region proposal generation. Subsequently, [Voxel RoI Pooling](https://paperswithcode.com/method/voxel-roi-pooling) directly extracts RoI features from the 3D feature volumes. Finally the RoI features are exploited in the detect head for further box refinement.",
  "title": "Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection",
  "collection": "3D Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "SyCoCa",
  "full_name": "Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment",
  "description": "Multimodal alignment between language and vision is the fundamental topic in current vision-language model research. Contrastive Captioners (CoCa), as a representative method, integrates Contrastive Language-Image Pretraining (CLIP) and Image Caption (IC) into a unified framework, resulting in impressive results. CLIP imposes a bidirectional constraints on global representation of entire images and sentences. Although IC conducts an unidirectional image-to-text generation on local representation, it lacks any constraint on local text-to-image reconstruction, which limits the ability to understand images at a fine-grained level when aligned with texts. To achieve multimodal alignment from both global and local perspectives, this paper proposes Symmetrizing Contrastive Captioners (SyCoCa), which introduces bidirectional interactions on images and texts across the global and local representation levels. Specifically, we expand a Text-Guided Masked Image Modeling (TG-MIM) head based on ITC and IC heads. The improved SyCoCa can further leverage textual cues to reconstruct contextual images and visual cues to predict textual contents. When implementing bidirectional local interactions, the local contents of images tend to be cluttered or unrelated to their textual descriptions. Thus, we employ an attentive masking strategy to select effective image patches for interaction. Extensive experiments on five vision-language tasks, including image-text retrieval, image-captioning, visual question answering, and zero-shot/finetuned image classification, validate the effectiveness of our proposed method.",
  "title": "SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment",
  "collection": "Multi-Modal Methods",
  "area": "Computer Vision"
}
{
  "name": "PipeTransformer",
  "full_name": "PipeTransformer",
  "description": "**PipeTransformer** is a method for automated elastic pipelining for efficient distributed training of [Transformer](https://paperswithcode.com/method/transformer) models. In PipeTransformer, an adaptive on the fly freeze algorithm is used that can identify and freeze some layers gradually during training, as well as an elastic pipelining system that can dynamically allocate resources to train the remaining active layers. More specifically, PipeTransformer automatically excludes frozen layers from the pipeline, packs active layers into fewer GPUs, and forks more replicas to increase data-parallel width.",
  "title": "PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "Weight excitation",
  "full_name": "Weight excitation",
  "description": "A novel built-in attention mechanism, that is complementary to all other prior attention mechanisms (e.g. squeeze and excitation, transformers) that are external (i.e., not built-in - please read paper for more details)",
  "title": "Weight Excitation: Built-in Attention Mechanisms in Convolutional Neural Networks",
  "collection": "Attention",
  "area": "General"
}
{
  "name": "MushroomRL",
  "full_name": "MushroomRL",
  "description": "**MushroomRL** is an open-source Python library developed to simplify the process of implementing and running Reinforcement Learning (RL) experiments. The architecture of MushroomRL is built in such a way that every component of an RL problem is already provided, and most of the time users can only focus on the implementation of their own algorithms and experiments. MushroomRL comes with a strongly modular architecture that makes it easy to understand how each component is structured and how it interacts with other ones; moreover it provides an exhaustive list of RL methodologies, such as:",
  "title": "MushroomRL: Simplifying Reinforcement Learning Research",
  "collection": "Reinforcement Learning Frameworks",
  "area": "Reinforcement Learning"
}
{
  "name": "XLM",
  "full_name": "XLM",
  "description": "**XLM** is a [Transformer](https://paperswithcode.com/method/transformer) based architecture that is pre-trained using one of three language modelling objectives:\r\n\r\n1. Causal Language Modeling - models the probability of a word given the previous words in a sentence.\r\n2. Masked Language Modeling - the masked language modeling objective of [BERT](https://paperswithcode.com/method/bert).\r\n3. Translation Language Modeling - a (new) translation language modeling objective for improving cross-lingual pre-training.\r\n\r\nThe authors find that both the CLM and MLM approaches provide strong cross-lingual features that can be used for pretraining models.",
  "title": "Cross-lingual Language Model Pretraining",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Library",
  "full_name": "Lib",
  "description": "Please enter a description about the method here",
  "title": "0/1 Deep Neural Networks via Block Coordinate Descent",
  "collection": null,
  "area": null
}
{
  "name": "DBlock",
  "full_name": "DBlock",
  "description": "**DBlock** is a residual based block used in the discriminator of the [GAN-TTS](https://paperswithcode.com/method/gan-tts) architecture. They are similar to the [GBlocks](https://paperswithcode.com/method/gblock) used in the generator, but without batch normalisation.",
  "title": "High Fidelity Speech Synthesis with Adversarial Networks",
  "collection": "Audio Model Blocks",
  "area": "Audio"
}
{
  "name": "DeiT",
  "full_name": "Data-efficient Image Transformer",
  "description": "A **Data-Efficient Image Transformer** is a type of [Vision Transformer](https://paperswithcode.com/method/vision-transformer) for image classification tasks. The model is trained using a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention.",
  "title": "Training data-efficient image transformers & distillation through attention",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "TAM",
  "full_name": "Temporal Adaptive Module",
  "description": "TAM is designed to capture complex temporal relationships both  efficiently and  flexibly,\r\nIt adopts an adaptive kernel instead of self-attention to capture  global contextual information, with lower time complexity \r\nthan GLTR.\r\n\r\nTAM has two branches, a local branch and a global branch. Given the input feature map $X\\in \\mathbb{R}^{C\\times T\\times H\\times W}$,  global spatial average pooling $\\text{GAP}$ is first applied to the feature map to ensure TAM has a low computational cost. Then the local branch in TAM employs several 1D convolutions with ReLU nonlinearity across the temporal domain to produce location-sensitive importance maps for enhancing frame-wise features.\r\nThe local branch can be written as\r\n\\begin{align}\r\n    s &= \\sigma(\\text{Conv1D}(\\delta(\\text{Conv1D}(\\text{GAP}(X)))))\r\n\\end{align}\r\n\\begin{align}\r\n    X^1 &= s X\r\n\\end{align}\r\nUnlike the local branch, the global branch is location invariant and focuses on generating a channel-wise adaptive kernel based on global temporal information in each channel. For the $c$-th channel, the  kernel can be written as\r\n\r\n\\begin{align}\r\n    \\Theta_c = \\text{Softmax}(\\text{FC}_2(\\delta(\\text{FC}_1(\\text{GAP}(X)_c)))) \r\n\\end{align}\r\n\r\nwhere $\\Theta_c \\in \\mathbb{R}^{K}$ and $K$ is the adaptive kernel size. Finally, TAM  convolves the adaptive kernel $\\Theta$ with $ X_\\text{out}^1$:\r\n\\begin{align}\r\n    Y = \\Theta \\otimes  X^1\r\n\\end{align}\r\n\r\nWith the help of the local branch and global branch,\r\nTAM can capture the complex temporal structures in video and \r\nenhance per-frame features at low computational cost.\r\nDue to its flexibility and lightweight design,\r\nTAM can be added to any existing 2D CNNs.",
  "title": "TAM: Temporal Adaptive Module for Video Recognition",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "ELMo",
  "full_name": "ELMo",
  "description": "**Embeddings from Language Models**, or **ELMo**, is a type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus.\r\n\r\nA biLM combines both a forward and backward LM.  ELMo jointly maximizes the log likelihood of the forward and backward directions. To add ELMo to a supervised model, we freeze the weights of the biLM and then concatenate the ELMo vector $\\textbf{ELMO}^{task}_k$ with $\\textbf{x}_k$ and pass the ELMO enhanced representation $[\\textbf{x}_k; \\textbf{ELMO}^{task}_k]$ into the task RNN. Here $\\textbf{x}_k$ is a context-independent token representation for each token position. \r\n\r\nImage Source: [here](https://medium.com/@duyanhnguyen_38925/create-a-strong-text-classification-with-the-help-from-elmo-e90809ba29da)",
  "title": "Deep contextualized word representations",
  "collection": "Word Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "IGSA",
  "full_name": "Improved Gravitational Search algorithm",
  "description": "Metaheuristic algorithm",
  "title": "GSANet: Semantic Segmentation with Global and Selective Attention",
  "collection": "Heuristic Search Algorithms",
  "area": "Reinforcement Learning"
}
{
  "name": "Hamburger",
  "full_name": "Hamburger",
  "description": "**Hamburger** is a global context module that employs matrix decomposition to factorize the learned representation into sub-matrices so as to recover the clean low-rank signal subspace. The key idea is, if we formulate the inductive bias like the global context into an objective function, the optimization algorithm to minimize the objective function can construct a computational graph, i.e., the architecture we need in the networks.",
  "title": "Is Attention Better Than Matrix Decomposition?",
  "collection": "Image Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "CS-GAN",
  "full_name": "CS-GAN",
  "description": "**CS-GAN** is a type of generative adversarial network that uses a form of deep compressed sensing, and [latent optimisation](https://paperswithcode.com/method/latent-optimisation), to improve the quality of generated samples.",
  "title": "Deep Compressed Sensing",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "TuckER-RP",
  "full_name": "TuckER with Relation Prediction",
  "description": "TuckER model trained with a relation prediction objective on top of the 1vsAll loss",
  "title": "Relation Prediction as an Auxiliary Training Objective for Improving Multi-Relational Graph Representations",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "DifferNet",
  "full_name": "DifferNet",
  "description": "",
  "title": "Same Same But DifferNet: Semi-Supervised Defect Detection with Normalizing Flows",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "RDF2Vec",
  "full_name": "RDF2Vec",
  "description": "",
  "title": "RDF2Vec: RDF Graph Embeddings and Their Applications",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "Inception-B",
  "full_name": "Inception-B",
  "description": "**Inception-B** is an image model block used in the [Inception-v4](https://paperswithcode.com/method/inception-v4) architecture.",
  "title": "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Spatial Propagation",
  "full_name": "Surface Nomral-based Spatial Propagation",
  "description": "Inspired by the spatial propagation mechanism utilized in the depth completion task \\cite{NLSPN}, we introduce a normal incorporated non-local disparity propagation module in which we hub NDP to generate non-local affinities and offsets for spatial propagation at the disparity level. The motivation lies that the sampled pixels for edges and occluded regions are supposed to be selected. The propagation process aggregates disparities via plane affinity relations, which alleviates the phenomenon of disparity blurring at object edges due to frontal parallel windows. And the disparities in occluded areas are also optimized at the same time by being propagated from non-occluded areas where the predicted disparities are with high confidence.",
  "title": "Digging Into Normal Incorporated Stereo Matching",
  "collection": "Stereo Depth Estimation Models",
  "area": "Computer Vision"
}
{
  "name": "GAIL",
  "full_name": "Generative Adversarial Imitation Learning",
  "description": "**Generative Adversarial Imitation Learning** presents a new general framework for directly extracting a policy from data, as if it were obtained by reinforcement learning following inverse reinforcement learning.",
  "title": "Generative Adversarial Imitation Learning",
  "collection": "Adversarial Training",
  "area": "General"
}
{
  "name": "Visformer",
  "full_name": "Visformer",
  "description": "**Visformer**, or **Vision-friendly Transformer**, is an architecture that combines [Transformer](https://paperswithcode.com/methods/category/transformers)-based architectural features with those from [convolutional neural network](https://paperswithcode.com/methods/category/convolutional-neural-networks) architectures. Visformer adopts the stage-wise design for higher base performance. But [self-attentions](https://paperswithcode.com/method/multi-head-attention) are only utilized in the last two stages, considering that self-attention in the high-resolution stage is relatively inefficient even when the FLOPs are balanced. Visformer employs [bottleneck blocks](https://paperswithcode.com/method/bottleneck-residual-block) in the first stage and utilizes [group 3 × 3 convolutions](https://paperswithcode.com/method/grouped-convolution) in bottleneck blocks inspired by [ResNeXt](https://paperswithcode.com/method/resnext). It also introduces [BatchNorm](https://paperswithcode.com/method/batch-normalization) to patch embedding modules as in CNNs.",
  "title": "Visformer: The Vision-friendly Transformer",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "SpineNet",
  "full_name": "SpineNet",
  "description": "**SpineNet** is a convolutional neural network backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by [Neural Architecture Search](https://paperswithcode.com/method/neural-architecture-search).",
  "title": "SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Feedback Memory",
  "full_name": "Feedback Memory",
  "description": "**Feedback Memory** is a type of attention module used in the [Feedback Transformer](https://paperswithcode.com/method/feedback-transformer) architecture. It allows a [transformer](https://paperswithcode.com/method/transformer) to to use the most abstract representations from the past directly as inputs for the current timestep. This means that the model does not form its representation in parallel, but sequentially token by token. More precisely, we replace the context inputs to attention modules with memory vectors that are computed over the past, i.e.:\r\n\r\n$$ \\mathbf{z}^{l}\\_{t} = \\text{Attn}\\left(\\mathbf{x}^{l}\\_{t},  \\left[\\mathbf{m}\\_{t−\\tau}, \\dots, \\mathbf{m}\\_{t−1}\\right]\\right) $$\r\n\r\nwhere a memory vector $\\mathbf{m}\\_{t}$ is computed by summing the representations of each layer at the $t$-th time step:\r\n\r\n$$ \\mathbf{m}\\_{t} = \\sum^{L}\\_{l=0}\\text{Softmax}\\left(w^{l}\\right)\\mathbf{x}\\_{t}^{l} $$\r\n\r\nwhere $w^{l}$ are learnable scalar parameters. Here $l = 0$ corresponds to token embeddings. The weighting of different layers by a [softmax](https://paperswithcode.com/method/softmax) output gives the model more flexibility as it can average them or select one of them. This modification of the self-attention input adapts the computation of the Transformer from parallel to sequential, summarized in the Figure. Indeed, it gives the ability to formulate the representation $\\mathbf{x}^{l}\\_{t+1}$ based on past representations from any layer $l'$, while in a standard Transformer this is only true for $l > l'$. This change can be viewed as exposing all previous computations to all future computations, providing better representations of the input. Such capacity would allow much shallower models to capture the same level of abstraction as a deeper architecture.",
  "title": "Addressing Some Limitations of Transformers with Feedback Memory",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "Content-based Attention",
  "full_name": "Content-based Attention",
  "description": "**Content-based attention** is an attention mechanism based on cosine similarity:\r\n\r\n$$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = \\cos\\left[\\textbf{h}\\_{i};\\textbf{s}\\_{j}\\right] $$\r\n\r\nIt was utilised in [Neural Turing Machines](https://paperswithcode.com/method/neural-turing-machine) as part of the Addressing Mechanism.\r\n\r\nWe produce a normalized attention weighting by taking a [softmax](https://paperswithcode.com/method/softmax) over these attention alignment scores.",
  "title": "Neural Turing Machines",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "ShakeDrop",
  "full_name": "ShakeDrop",
  "description": "**ShakeDrop regularization** extends [Shake-Shake regularization](https://paperswithcode.com/method/shake-shake-regularization) and can be applied not only to [ResNeXt](https://paperswithcode.com/method/resnext) but also [ResNet](https://paperswithcode.com/method/resnet), [WideResNet](https://paperswithcode.com/method/wideresnet), and [PyramidNet](https://paperswithcode.com/method/pyramidnet). The proposed ShakeDrop is given as\r\n\r\n$$G\\left(x\\right) = x + \\left(b\\_{l} + \\alpha − b\\_{l}\\alpha\\right)F\\left(x\\right), \\text{ in train-fwd} $$\r\n$$G\\left(x\\right) = x + \\left(b\\_{l} + \\beta − b\\_{l}\\beta\\right)F\\left(x\\right), \\text{ in train-bwd} $$\r\n$$G\\left(x\\right) = x + E\\left[b\\_{l} + \\alpha − b\\_{l}\\alpha\\right]F\\left(x\\right), \\text{ in test} $$\r\n\r\nwhere $b\\_{l}$ is a Bernoulli random variable with probability $P\\left(b\\_{l} = 1\\right) = E\\left[b\\_{l}\r\n\\right] = p\\_{l}$ given by the linear decay rule in each layer, and $\\alpha$ and $\\beta$ are independent uniform random variables in each element. \r\n\r\nThe most effective ranges of $\\alpha$ and $\\beta$ were experimentally found to be different from those of Shake-Shake, and are $\\alpha$ = 0, $\\beta \\in \\left[0, 1\\right]$ and $\\alpha \\in \\left[−1, 1\\right]$, $\\beta \\in \\left[0, 1\\right]$.",
  "title": "ShakeDrop Regularization for Deep Residual Learning",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Hermite Activation",
  "full_name": "Hermite Polynomial Activation",
  "description": "A **Hermite Activations** is a type of activation function which uses a smooth finite Hermite polynomial base as a substitute for non-smooth [ReLUs](https://paperswithcode.com/method/relu). \r\n\r\nRelevant Paper: [Lokhande et al](https://arxiv.org/pdf/1909.05479.pdf)",
  "title": null,
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Linear Warmup With Cosine Annealing",
  "full_name": "Linear Warmup With Cosine Annealing",
  "description": "**Linear Warmup With Cosine Annealing** is a learning rate schedule where we increase the learning rate linearly for $n$ updates and then anneal according to a cosine schedule afterwards.",
  "title": null,
  "collection": "Learning Rate Schedules",
  "area": "General"
}
{
  "name": "DAGNN",
  "full_name": "Directed Acyclic Graph Neural Network",
  "description": "A GNN for dags, which injects their topological order as an inductive bias via asynchronous message passing.",
  "title": "Directed Acyclic Graph Neural Networks",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "SERLU",
  "full_name": "SERLU",
  "description": "**SERLU**, or **Scaled Exponentially-Regularized Linear Unit**, is a type of activation function. The new function introduces a bump-shaped function in the region of negative input. The bump-shaped function has approximately zero response to large negative input while being able to push the output of SERLU towards zero mean statistically.\r\n\r\n$$ \\text{SERLU}\\left(x\\right)) = \\lambda\\_{serlu}x \\text{ if } x \\geq 0 $$\r\n$$ \\text{SERLU}\\left(x\\right)) = \\lambda\\_{serlu}\\alpha\\_{serlu}xe^{x} \\text{ if } x < 0 $$\r\n\r\nwhere the two parameters $\\lambda\\_{serlu} > 0$ and $\\alpha\\_{serlu} > 0$ remain to be specified.",
  "title": "Effectiveness of Scaled Exponentially-Regularized Linear Units (SERLUs)",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "CSPResNeXt Block",
  "full_name": "CSPResNeXt Block",
  "description": "**CSPResNeXt Block** is an extended [ResNext Block](https://paperswithcode.com/method/resnext-block) where we partition the feature map of the base layer into two parts and then merge them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network.",
  "title": "CSPNet: A New Backbone that can Enhance Learning Capability of CNN",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "LayerScale",
  "full_name": "LayerScale",
  "description": "**LayerScale** is a method used for [vision transformer](https://paperswithcode.com/methods/category/vision-transformer) architectures to help improve training dynamics. It adds a learnable diagonal matrix on output of each residual block, initialized close to (but not at) 0. Adding this simple layer after each residual block improves the training dynamic, allowing for the training of deeper high-capacity image transformers that benefit from depth.\r\n\r\nSpecifically, LayerScale is a per-channel multiplication of the vector produced by each residual block, as opposed to a single scalar, see Figure (d). The objective is to group the updates of the weights associated with the same output channel. Formally, LayerScale is a multiplication by a diagonal matrix on output of each residual block. In other words:\r\n\r\n$$\r\nx\\_{l}^{\\prime} =x\\_{l}+\\operatorname{diag}\\left(\\lambda\\_{l, 1}, \\ldots, \\lambda\\_{l, d}\\right) \\times \\operatorname{SA}\\left(\\eta\\left(x\\_{l}\\right)\\right) \r\n$$\r\n\r\n$$\r\nx\\_{l+1} =x\\_{l}^{\\prime}+\\operatorname{diag}\\left(\\lambda\\_{l, 1}^{\\prime}, \\ldots, \\lambda\\_{l, d}^{\\prime}\\right) \\times \\operatorname{FFN}\\left(\\eta\\left(x\\_{l}^{\\prime}\\right)\\right)\r\n$$\r\n\r\nwhere the parameters $\\lambda\\_{l, i}$ and $\\lambda\\_{l, i}^{\\prime}$ are learnable weights. The diagonal values are all initialized to a fixed small value $\\varepsilon:$ we set it to $\\varepsilon=0.1$ until depth 18 , $\\varepsilon=10^{-5}$ for depth 24 and $\\varepsilon=10^{-6}$ for deeper networks. \r\n\r\nThis formula is akin to other [normalization](https://paperswithcode.com/methods/category/normalization) strategies [ActNorm](https://paperswithcode.com/method/activation-normalization) or [LayerNorm](https://paperswithcode.com/method/layer-normalization) but executed on output of the residual block. Yet LayerScale seeks a different effect: [ActNorm](https://paperswithcode.com/method/activation-normalization) is a data-dependent initialization that calibrates activations so that they have zero-mean and unit variance, like [BatchNorm](https://paperswithcode.com/method/batch-normalization). In contrast, in LayerScale, we initialize the diagonal with small values so that the initial contribution of the residual branches to the function implemented by the transformer is small. In that respect the motivation is therefore closer to that of [ReZero](https://paperswithcode.com/method/rezero), [SkipInit](https://paperswithcode.com/method/skipinit), [Fixup](https://paperswithcode.com/method/fixup-initialization) and [T-Fixup](https://paperswithcode.com/method/t-fixup): to train closer to the identity function and let the network integrate the additional parameters progressively during the training. LayerScale offers more diversity in the optimization than just adjusting the whole layer by a single learnable scalar as in [ReZero](https://paperswithcode.com/method/rezero)/[SkipInit](https://paperswithcode.com/method/skipinit), [Fixup](https://paperswithcode.com/method/fixup-initialization) and [T-Fixup](https://paperswithcode.com/method/t-fixup).",
  "title": "Going deeper with Image Transformers",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Twins-PCPVT",
  "full_name": "Twins-PCPVT",
  "description": "**Twins-PCPVT** is a type of [vision transformer](https://paperswithcode.com/methods/category/vision-transformer) that combines global attention, specifically the global sub-sampled attention as proposed in [Pyramid Vision Transformer](https://paperswithcode.com/method/pvt), with [conditional position encodings](https://paperswithcode.com/method/conditional-positional-encoding) (CPE) to replace the [absolute position encodings](https://paperswithcode.com/method/absolute-position-encodings) used in PVT.\r\n\r\nThe [position encoding generator](https://paperswithcode.com/method/positional-encoding-generator) (PEG), which generates the CPE, is placed after the first encoder block of each stage. The simplest form of PEG is used, i.e., a 2D [depth-wise convolution](https://paperswithcode.com/method/depthwise-convolution) without [batch normalization](https://paperswithcode.com/method/batch-normalization). For image-level classification, following [CPVT](https://paperswithcode.com/method/cpvt), the class token is removed and [global average pooling](https://paperswithcode.com/method/global-average-pooling) is used at the end of the stage. For other vision tasks, the design of PVT is followed.",
  "title": "Twins: Revisiting the Design of Spatial Attention in Vision Transformers",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "FixMatch",
  "full_name": "FixMatch",
  "description": "FixMatch is an algorithm that first generates pseudo-labels using the model's predictions on weakly-augmented unlabeled images. For a given image, the pseudo-label is only retained if the model produces a high-confidence prediction. The model is then trained to predict the pseudo-label when fed a strongly-augmented version of the same image.\r\n\r\nDescription from: [FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence](https://paperswithcode.com/paper/fixmatch-simplifying-semi-supervised-learning)\r\n\r\nImage credit:  [FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence](https://paperswithcode.com/paper/fixmatch-simplifying-semi-supervised-learning)",
  "title": "FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "Bottom-up Path Augmentation",
  "full_name": "Bottom-up Path Augmentation",
  "description": "**Bottom-up Path Augmentation** is a feature extraction technique that seeks to shorten the information path and enhance a feature pyramid with accurate localization signals existing in low-levels. This is based on the fact that high response to edges or instance parts is a strong indicator to accurately localize instances. \r\n\r\nEach building block takes a higher resolution feature map $N\\_{i}$ and a coarser map $P\\_{i+1}$ through lateral connection and generates the new feature map $N\\_{i+1}$ Each feature map $N\\_{i}$ first goes through a $3 \\times 3$ convolutional layer with stride $2$ to reduce the spatial size. Then each element of feature map $P\\_{i+1}$ and the down-sampled map are added through lateral connection. The fused feature map is then processed by another $3 \\times 3$ convolutional layer to generate $N\\_{i+1}$ for following sub-networks. This is an iterative process and terminates after approaching $P\\_{5}$. In these building blocks, we consistently use channel 256 of feature maps. The feature grid for each proposal is then pooled from new feature maps, i.e., {$N\\_{2}$, $N\\_{3}$, $N\\_{4}$, $N\\_{5}$}.",
  "title": "Path Aggregation Network for Instance Segmentation",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "KIP",
  "full_name": "Kernel Inducing Points",
  "description": "**Kernel Inducing Points**, or **KIP**, is a meta-learning algorithm for learning datasets that can mitigate the challenges which occur for naturally occurring datasets without a significant sacrifice in performance. KIP uses kernel-ridge regression to learn \u000f$\\epsilon$-approximate datasets. It can be regarded as an adaption of the inducing point method for Gaussian processes to the case of Kernel Ridge Regression.",
  "title": "Dataset Meta-Learning from Kernel Ridge-Regression",
  "collection": "Meta-Learning Algorithms",
  "area": "General"
}
{
  "name": "LSDM",
  "full_name": "Language-driven Scene Synthesis using Multi-conditional Diffusion Model",
  "description": "Our main contribution is the Guiding Points Network, where we integrate all information from the conditions to generate guiding points. By applying transformation matrices to scene entities (human/objects) with attention weighting, we can forecast the spanning of the target object.",
  "title": null,
  "collection": "3D Representations",
  "area": "Computer Vision"
}
{
  "name": "GANformer",
  "full_name": "Generative Adversarial Transformer",
  "description": "GANformer is a novel and efficient type of [transformer](https://paperswithcode.com/method/transformer) which can be used for visual generative modeling. The network employs a bipartite structure that enables long-range interactions across an image, while maintaining computation of linearly efficiency, that can readily scale to high-resolution synthesis. It iteratively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compositional representations of objects and scenes.\r\n\r\nSource: [Generative Adversarial Transformers](https://arxiv.org/pdf/2103.01209v2.pdf)\r\n\r\nImage source: [Generative Adversarial Transformers](https://arxiv.org/pdf/2103.01209v2.pdf)",
  "title": "Generative Adversarial Transformers",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "ReasonBERT",
  "full_name": "ReasonBERT",
  "description": "**ReasonBERT** is a pre-training method that augments language models with the ability to reason over long-range relations and multiple, possibly hybrid, contexts. It utilizes distant supervision to automatically connect multiple pieces of text and tables to create pre-training examples that require long-range reasoning. Different types of reasoning are simulated, including intersecting multiple pieces of evidence, bridging from one piece of evidence to another, and detecting unanswerable cases. \r\n\r\nSpecifically, given a query sentence containing an entity pair, if we mask one of the entities, another sentence or table that contains the same pair of entities can likely be used as evidence to recover the masked entity. Moreover, to encourage deeper reasoning, multiple pieces of evidence are collected that are jointly used to recover the masked entities in the query sentence, allowing for the scattering of the masked entities among different pieces of evidence to mimic different types of reasoning. \r\n\r\nThe Figure illustrates several examples using such distant supervision. In Ex. 1, a model needs to check multiple constraints (i.e., intersection reasoning type) and find “the beach soccer competition that is established in 1998.” In Ex. 2, a model needs to find “the type of the band that released Awaken the Guardian,” by first inferring the name of the band “Fates Warning” (i.e., bridging reasoning type). \r\n\r\nThe masked entities in a query sentence are replaced with the [QUESTION] tokens. The new pre-training objective, span reasoning, then extracts the masked entities from the provided evidence. Existing LMs like [BERT](https://paperswithcode.com/method/bert) and [RoBERTa](https://paperswithcode.com/method/roberta) are augmented by continuing to train them with the new objective, which leads to ReasonBERT. Then query sentence and textual evidence are encoded via the LM. When tabular evidence is present, the structure-aware [transformer](https://paperswithcode.com/method/transformer) [TAPAS](https://paperswithcode.com/method/tapas) is used as the encoder to capture the table structure.",
  "title": "ReasonBERT: Pre-trained to Reason with Distant Supervision",
  "collection": "Language Model Pre-Training",
  "area": "Natural Language Processing"
}
{
  "name": "CELU",
  "full_name": "Continuously Differentiable Exponential Linear Units",
  "description": "Exponential Linear Units (ELUs) are a useful rectifier for constructing deep learning architectures, as they may speed up and otherwise improve learning by virtue of not have vanishing gradients and by having mean activations near zero. However, the ELU activation as parametrized in [1] is not continuously differentiable with respect to its input when the shape parameter alpha is not equal to 1. We present an alternative parametrization which is C1 continuous for all values of alpha, making the rectifier easier to reason about and making alpha easier to tune. This alternative parametrization has several other useful properties that the original parametrization of ELU does not: 1) its derivative with respect to x is bounded, 2) it contains both the linear transfer function and ReLU as special cases, and 3) it is scale-similar with respect to alpha.\r\n$$\\text{CELU}(x) = \\max(0,x) + \\min(0, \\alpha * (\\exp(x/\\alpha) - 1))$$",
  "title": null,
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Class Activation Guided Attention Mechanism",
  "full_name": "Class Activation Guided Attention Mechanism (CAGAM)",
  "description": "CAGAM is a form of spatial attention mechanism that propagates attention from a known to an unknown context features thereby enhancing the unknown context for relevant pattern discovery. Usually the known context feature is a class activation map ([CAM](https://paperswithcode.com/method/cam)).",
  "title": "Rendezvous: Attention Mechanisms for the Recognition of Surgical Action Triplets in Endoscopic Videos",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Wizard",
  "full_name": "Wizard: Unsupervised goats tracking algorithm",
  "description": "Computer vision is an interesting tool for animal behavior monitoring, mainly because it limits animal handling and it can be used to record various traits using only one sensor. From previous studies, this technic has shown to be suitable for various species and behavior. However it remains challenging to collect individual information, i.e. not only to detect animals and behavior on the video frames, but also to identify them. Animal identification is a prerequisite to gather individual information in order to characterize individuals and compare them. A common solution to this problem, known as multiple objects tracking, consists in detecting the animals on each video frame, and then associate detections to a unique animal ID. Association of detections between two consecutive frames are generally made to maintain coherence of the detection locations and appearances. To extract appearance information, a common solution is to use a convolutional neural network (CNN), trained on a large dataset before running the tracking algorithm. For farmed animals, designing such network is challenging as far as large training dataset are still lacking. In this article, we proposed an innovative solution, where the CNN used to extract appearance information is parameterized using offline unsupervised training. The algorithm, named Wizard, was evaluated for the purpose of goats monitoring in outdoor conditions. 17 annotated videos were used, for a total of 4H30, with various number of animals on the video (from 3 to 8) and different level of color differences between animals. First, the ability of the algorithm to track the detected animals was evaluated. When animals were detected, the algorithm found the correct animal ID in 94.82% of the frames. When tracking and detection were evaluated together, we found that Wizard found the correct animal ID in 86.18% of the video length. In situations where the animal detection rate could be high, Wizard seems to be a suitable solution for individual behavior analysis experiments based on computer vision.",
  "title": null,
  "collection": "Multi-Object Tracking Models",
  "area": "Computer Vision"
}
{
  "name": "LipGAN",
  "full_name": "LipGAN",
  "description": "**LipGAN** is a generative adversarial network for generating realistic talking faces conditioned on translated speech. It employs an adversary that measures the extent of lip synchronization in the frames generated by the generator. The system is capable of handling faces in random poses without the need for realignment to a template pose. LipGAN is a fully self-supervised approach that learns a phoneme-viseme mapping, making it language independent.",
  "title": "Towards Automatic Face-to-Face Translation",
  "collection": "Generative Adversarial Networks",
  "area": "Computer Vision"
}
{
  "name": "Implicit PointRend",
  "full_name": "Implicit PointRend",
  "description": "**Implicit PointRend** is a modification to the [PointRend](https://paperswithcode.com/method/pointrend) module for instance segmentation. Instead of a coarse mask prediction used in [PointRend](https://paperswithcode.com/method/pointrend) to provide region-level context to distinguish objects, for each object Implicit PointRend generates different parameters for a function that makes the final pointwise mask prediction. The new model is more straightforward than PointRend: (1) it does not require an importance point sampling during training and (2) it uses a single point-level mask loss instead of two mask losses. Implicit PointRend can be trained directly with point supervision without any intermediate prediction interpolation steps.",
  "title": "Pointly-Supervised Instance Segmentation",
  "collection": "Instance Segmentation Modules",
  "area": "Computer Vision"
}
{
  "name": "UNETR",
  "full_name": "UNet Transformer",
  "description": "**UNETR**, or **UNet Transformer**, is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based architecture for [medical image segmentation](https://paperswithcode.com/task/medical-image-segmentation) that utilizes a pure [transformer](https://paperswithcode.com/method/transformer) as the encoder to learn sequence representations of the input volume -- effectively capturing the global multi-scale information. The transformer encoder is directly connected to a decoder via [skip connections](https://paperswithcode.com/methods/category/skip-connections) at different resolutions like a [U-Net](https://paperswithcode.com/method/u-net) to compute the final semantic segmentation output.",
  "title": "UNETR: Transformers for 3D Medical Image Segmentation",
  "collection": "Medical Image Models",
  "area": "Computer Vision"
}
{
  "name": "LARS",
  "full_name": "LARS",
  "description": "**Layer-wise Adaptive Rate Scaling**, or **LARS**, is a large batch optimization technique.  There are two notable differences between LARS and other adaptive algorithms such as [Adam](https://paperswithcode.com/method/adam) or [RMSProp](https://paperswithcode.com/method/rmsprop): first, LARS uses a separate learning rate for each layer and not for each weight. And second, the magnitude of the update is controlled with respect to the weight norm for better control of training speed.\r\n\r\n$$m\\_{t} = \\beta\\_{1}m\\_{t-1} + \\left(1-\\beta\\_{1}\\right)\\left(g\\_{t} + \\lambda{x\\_{t}}\\right)$$\r\n$$x\\_{t+1}^{\\left(i\\right)} = x\\_{t}^{\\left(i\\right)}  - \\eta\\_{t}\\frac{\\phi\\left(|| x\\_{t}^{\\left(i\\right)} ||\\right)}{|| m\\_{t}^{\\left(i\\right)} || }m\\_{t}^{\\left(i\\right)} $$",
  "title": "Large Batch Training of Convolutional Networks",
  "collection": "Large Batch Optimization",
  "area": "General"
}
{
  "name": "PASE+",
  "full_name": "Problem Agnostic Speech Encoder +",
  "description": "**PASE+** is a problem-agnostic speech encoder that combines a convolutional encoder followed by multiple neural networks, called workers, tasked to solve self-supervised problems (i.e., ones that do not require manual annotations as ground truth). An online speech distortion module is employed, that contaminates the input signals with a variety of random disturbances. A revised encoder is also proposed that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks. Finally, the authors refine the set of workers used in self-supervision to encourage better cooperation.",
  "title": "Multi-task self-supervised learning for Robust Speech Recognition",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "fastText",
  "full_name": "fastText",
  "description": "**fastText** embeddings exploit subword information to construct word embeddings. Representations are learnt of character $n$-grams, and words represented as the sum of the $n$-gram vectors. This extends the word2vec type models with subword information. This helps the embeddings understand suffixes and prefixes. Once a word is represented using character $n$-grams, a skipgram model is trained to learn the embeddings.",
  "title": "Enriching Word Vectors with Subword Information",
  "collection": "Word Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "Explanation vs Attention",
  "full_name": "Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA",
  "description": "In this paper, we aim to obtain improved attention for a visual question answering (VQA) task. It is challenging to provide supervision for attention. An observation we make is that visual explanations as obtained through class activation mappings (specifically Grad-[CAM](https://paperswithcode.com/method/cam)) that are meant to explain the performance of various networks could form a means of supervision. However, as the distributions of attention maps and that of Grad-CAMs differ, it would not be suitable to directly use these as a form of supervision. Rather, we propose the use of a discriminator that aims to distinguish samples of visual explanation and attention maps. The use of adversarial training of the attention regions as a two-player game between attention and explanation serves to bring the distributions of attention maps and visual explanations closer. Significantly, we observe that providing such a means of supervision also results in attention maps that are more closely related to human attention resulting in a substantial improvement over baseline stacked attention network (SAN) models. It also results in a good improvement in rank correlation metric on the VQA task. This method can also be combined with recent MCB based methods and results in consistent improvement. We also provide comparisons with other means for learning distributions such as based on Correlation Alignment (Coral), Maximum Mean Discrepancy (MMD) and Mean Square Error (MSE) losses and observe that the adversarial loss outperforms the other forms of learning the attention maps. Visualization of the results also confirms our hypothesis that attention maps improve using this form of supervision.",
  "title": null,
  "collection": "Adversarial Training",
  "area": "General"
}
{
  "name": "Instance Normalization",
  "full_name": "Instance Normalization",
  "description": "**Instance Normalization** (also known as contrast normalization) is a normalization layer where:\r\n\r\n$$\r\n    y_{tijk} =  \\frac{x_{tijk} - \\mu_{ti}}{\\sqrt{\\sigma_{ti}^2 + \\epsilon}},\r\n    \\quad\r\n    \\mu_{ti} = \\frac{1}{HW}\\sum_{l=1}^W \\sum_{m=1}^H x_{tilm},\r\n    \\quad\r\n    \\sigma_{ti}^2 = \\frac{1}{HW}\\sum_{l=1}^W \\sum_{m=1}^H (x_{tilm} - \\mu_{ti})^2.\r\n$$\r\n\r\nThis prevents instance-specific mean and covariance shift simplifying the learning process. Intuitively, the normalization process allows to remove instance-specific contrast information from the content image in a task like image stylization, which simplifies generation.",
  "title": "Instance Normalization: The Missing Ingredient for Fast Stylization",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "Depthwise Convolution",
  "full_name": "Depthwise Convolution",
  "description": "**Depthwise Convolution** is a type of convolution where we apply a single convolutional filter for each input channel. In the regular 2D [convolution](https://paperswithcode.com/method/convolution) performed over multiple input channels, the filter is as deep as the input and lets us freely mix channels to generate each element in the output. In contrast, depthwise convolutions keep each channel separate. To summarize the steps, we:\r\n\r\n1. Split the input and filter into channels.\r\n2. We convolve each input with the respective filter.\r\n3. We stack the convolved outputs together.\r\n\r\nImage Credit: [Chi-Feng Wang](https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)",
  "title": null,
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "Faster R-CNN",
  "full_name": "Faster R-CNN",
  "description": "**Faster R-CNN** is an object detection model that improves on [Fast R-CNN](https://paperswithcode.com/method/fast-r-cnn) by utilising a region proposal network ([RPN](https://paperswithcode.com/method/rpn)) with the CNN model. The RPN shares full-image convolutional features with the detection network, enabling nearly cost-free region proposals. It is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by [Fast R-CNN](https://paperswithcode.com/method/fast-r-cnn) for detection. RPN and Fast [R-CNN](https://paperswithcode.com/method/r-cnn) are merged into a single network by sharing their convolutional features: the RPN component tells the unified network where to look.\r\n\r\nAs a whole, Faster R-CNN consists of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector that uses the proposed regions.",
  "title": "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "BlendMask",
  "full_name": "BlendMask",
  "description": "**BlendMask** is an [instance segmentation framework](https://paperswithcode.com/methods/category/instance-segmentation-models) built on top of the[ FCOS](https://paperswithcode.com/method/fcos) object detector. The bottom module uses either backbone or [FPN](https://paperswithcode.com/method/fpn) features to predict a set of bases. A single [convolution](https://paperswithcode.com/methods/category/convolutions) layer is added on top of the detection towers to produce attention masks along with each bounding box prediction. For each predicted instance, the [blender](https://paperswithcode.com/method/blender) crops the bases with its bounding box and linearly combine them according the learned attention maps. Note that the Bottom Module can take features either from ‘C’, or ‘P’ as the input.",
  "title": "BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation",
  "collection": "Instance Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "Replacing Eligibility Trace",
  "full_name": "Replacing Eligibility Trace",
  "description": "In a **Replacing Eligibility Trace**, each time the state is revisited, the trace is reset to $1$ regardless of the presence of a prior trace.. For the memory vector $\\textbf{e}\\_{t} \\in \\mathbb{R}^{b} \\geq \\textbf{0}$:\r\n\r\n$$\\mathbf{e\\_{0}} = \\textbf{0}$$\r\n\r\n$$\\textbf{e}\\_{t} = \\gamma\\lambda{e}\\_{t-1}\\left(s\\right) \\text{ if } s \\neq s\\_{t}$$\r\n\r\n$$\\textbf{e}\\_{t} = 1 \\text{ if } s = s\\_{t}$$\r\n\r\nThey can be seen as crude approximation to dutch traces, which have largely superseded them as they perform better than replacing traces and have a clearer theoretical basis. Accumulating traces remain of interest for nonlinear function approximations where dutch traces are not available.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition",
  "title": null,
  "collection": "Eligibility Traces",
  "area": "Reinforcement Learning"
}
{
  "name": "mT5",
  "full_name": "mT5",
  "description": "**mt5** is a multilingual variant of [T5](https://paperswithcode.com/method/t5) that was pre-trained on a new Common Crawl-based dataset covering $101$ languages.",
  "title": "mT5: A massively multilingual pre-trained text-to-text transformer",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "k-Sparse Autoencoder",
  "full_name": "k-Sparse Autoencoder",
  "description": "**k-Sparse Autoencoders** are autoencoders with linear activation function, where in hidden layers only the $k$ highest activities are kept. This achieves exact sparsity in the hidden representation. Backpropagation only goes through the the top $k$ activated units. This can be achieved with a [ReLU](https://paperswithcode.com/method/relu) layer with an adjustable threshold.",
  "title": "k-Sparse Autoencoders",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Inception-v4",
  "full_name": "Inception-v4",
  "description": "**Inception-v4** is a convolutional neural network architecture that builds on previous iterations of the Inception family by simplifying the architecture and using more inception modules than [Inception-v3](https://paperswithcode.com/method/inception-v3).",
  "title": "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Elastic ResNeXt Block",
  "full_name": "Elastic ResNeXt Block",
  "description": "An **Elastic ResNeXt Block** is a modification of the [ResNeXt Block](https://paperswithcode.com/method/resnext-block) that adds downsamplings and upsamplings in parallel branches at each layer. It is called \"elastic\" because each layer in the network is flexible in terms of choosing the best scale by a soft policy.",
  "title": "ELASTIC: Improving CNNs with Dynamic Scaling Policies",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "DALL·E 2",
  "full_name": "DALL·E 2",
  "description": "**DALL·E 2** is a generative text-to-image model made up of two main components: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding.",
  "title": "Hierarchical Text-Conditional Image Generation with CLIP Latents",
  "collection": "Image Generation Models",
  "area": "Computer Vision"
}
{
  "name": "A2C",
  "full_name": "A2C",
  "description": "**A2C**, or **Advantage Actor Critic**, is a synchronous version of the [A3C](https://paperswithcode.com/method/a3c) policy gradient method. As an alternative to the asynchronous implementation of A3C, A2C is a synchronous, deterministic implementation that waits for each actor to finish its segment of experience before updating, averaging over all of the actors. This more effectively uses GPUs due to larger batch sizes.\r\n\r\nImage Credit: [OpenAI Baselines](https://openai.com/blog/baselines-acktr-a2c/)",
  "title": "Asynchronous Methods for Deep Reinforcement Learning",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "CoVe",
  "full_name": "Contextual Word Vectors",
  "description": "**CoVe**, or **Contextualized Word Vectors**, uses a deep [LSTM](https://paperswithcode.com/method/lstm) encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors. $\\text{CoVe}$ word embeddings are therefore a function of the entire input sequence. These word embeddings can then be used in downstream tasks by concatenating them with $\\text{GloVe}$ embeddings:\r\n\r\n$$ v = \\left[\\text{GloVe}\\left(x\\right), \\text{CoVe}\\left(x\\right)\\right]$$\r\n\r\nand then feeding these in as features for the task-specific models.",
  "title": "Learned in Translation: Contextualized Word Vectors",
  "collection": "Word Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "Scaled Dot-Product Attention",
  "full_name": "Scaled Dot-Product Attention",
  "description": "**Scaled dot-product attention** is an attention mechanism where the dot products are scaled down by $\\sqrt{d_k}$. Formally we have a query $Q$, a key $K$ and a value $V$ and calculate the attention as:\r\n\r\n$$ {\\text{Attention}}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^{T}}{\\sqrt{d_k}}\\right)V $$\r\n\r\nIf we assume that $q$ and $k$ are $d_k$-dimensional vectors whose components are independent random variables with mean $0$ and variance $1$, then their dot product, $q \\cdot k = \\sum_{i=1}^{d_k} u_iv_i$, has mean $0$ and variance $d_k$.  Since we would prefer these values to have variance $1$, we divide by $\\sqrt{d_k}$.",
  "title": "Attention Is All You Need",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Multi-DConv-Head Attention",
  "full_name": "Multi-DConv-Head Attention",
  "description": "**Multi-DConv-Head Attention**, or **MDHA**, is a type of [Multi-Head Attention](https://paperswithcode.com/method/multi-head-attention) that utilizes [depthwise convolutions](https://paperswithcode.com/method/depthwise-convolution) after the multi-head projections. It is used in the [Primer](https://paperswithcode.com/method/primer) [Transformer](https://paperswithcode.com/method/transformer) architecture.\r\n\r\nSpecifically, 3x1 depthwise convolutions are added after each of the multi-head projections for query $Q$, key $K$ and value $V$ in self-attention. These depthwise convolutions are performed over the spatial dimension of each dense projection’s output. Interestingly, this ordering of pointwise followed by depthwise convolution is the reverse of typical [separable convolution](https://paperswithcode.com/method/depthwise-separable-convolution), which the authors find to be less effective. They also find that wider depthwise convolution and [standard convolution](https://paperswithcode.com/method/convolution) not only do not improve performance, but in several cases hurt it. \r\n\r\nMDHA is similar to [Convolutional Attention](https://paperswithcode.com/method/cvt), which uses [separable convolution](https://paperswithcode.com/method/depthwise-separable-convolution) instead of depthwise convolution and does not apply convolution operations per attention head as in MDHA.",
  "title": "Primer: Searching for Efficient Transformers for Language Modeling",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "Tunable Network",
  "full_name": "Tunable Network",
  "description": "",
  "title": "Deep Network Interpolation for Continuous Imagery Effect Transition",
  "collection": "Domain Adaptation",
  "area": "General"
}
{
  "name": "GridMask",
  "full_name": "GridMask",
  "description": "**GridMask** is a data augmentation method that randomly removes some pixels of an input image. Unlike other methods, the region that the algorithm removes is neither a continuous region nor random pixels in dropout. Instead, the algorithm removes a region with disconnected pixel sets, as shown in the Figure.\r\n\r\nWe express the setting as\r\n\r\n$$\r\n\\tilde{\\mathbf{x}}=\\mathbf{x} \\times M\r\n$$\r\n\r\nwhere $\\mathbf{x} \\in R^{H \\times W \\times C}$ represents the input image, $M \\in$ $\\{0,1\\}^{H \\times W}$ is the binary mask that stores pixels to be removed, and $\\tilde{\\mathbf{x}} \\in R^{H \\times W \\times C}$ is the result produced by the algorithm. For the binary mask $M$, if $M_{i, j}=1$ we keep pixel $(i, j)$ in the input image; otherwise we remove it. GridMask is applied after the image normalization operation.\r\n\r\nThe shape of $M$ looks like a grid, as shown in the Figure . Four numbers $\\left(r, d, \\delta_{x}, \\delta_{y}\\right)$ are used to represent a unique $M$. Every mask is formed by tiling the units. $r$ is the ratio of the shorter gray edge in a unit. $d$ is the length of one unit. $\\delta\\_{x}$ and $\\delta\\_{y}$ are the distances between the first intact unit and boundary of the image.",
  "title": "GridMask Data Augmentation",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "CGMM",
  "full_name": "Contextual Graph Markov Model",
  "description": "Contextual Graph Markov Model (CGMM) is an approach combining ideas from generative models and neural networks for the processing of graph data. It founds on a constructive methodology to build a deep architecture comprising layers of probabilistic models that learn to encode the structured information in an incremental fashion. Context is diffused in an efficient and scalable way across the graph vertexes and edges. The resulting graph encoding is used in combination with discriminative models to address structure classification benchmarks.\r\n\r\nDescription and image from: [Contextual Graph Markov Model: A Deep and Generative Approach to Graph Processing](https://arxiv.org/pdf/1805.10636.pdf)",
  "title": "Contextual Graph Markov Model: A Deep and Generative Approach to Graph Processing",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "EESP",
  "full_name": "Extremely Efficient Spatial Pyramid of Depth-wise Dilated Separable Convolutions",
  "description": "An **EESP Unit**, or  Extremely Efficient Spatial Pyramid of Depth-wise Dilated Separable Convolutions, is an image model block designed for edge devices. It was proposed as part of the [ESPNetv2](https://paperswithcode.com/method/espnetv2) CNN architecture. \r\n\r\nThis building block is based on a reduce-split-transform-merge strategy. The EESP unit first projects the high-dimensional input feature maps into low-dimensional space using groupwise pointwise convolutions and then learns the representations in parallel using depthwise dilated separable convolutions with different dilation rates. Different dilation rates in each branch allow the EESP unit to learn the representations from a large effective receptive field. To remove the gridding artifacts caused by dilated convolutions, the EESP fuses the feature maps using [hierarchical feature fusion](https://paperswithcode.com/method/hierarchical-feature-fusion) (HFF).",
  "title": "ESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural Network",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Contrastive Predictive Coding",
  "full_name": "Contrastive Predictive Coding",
  "description": "**Contrastive Predictive Coding (CPC)** learns self-supervised representations by predicting the future in latent space by using powerful autoregressive models. The model uses a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful\r\nto predict future samples.\r\n\r\nFirst, a non-linear encoder $g\\_{enc}$ maps the input sequence of observations $x\\_{t}$ to a sequence of latent representations $z\\_{t} = g\\_{enc}\\left(x\\_{t}\\right)$, potentially with a lower temporal resolution. Next, an autoregressive model $g\\_{ar}$ summarizes all $z\\leq{t}$ in the latent space and produces a context latent representation $c\\_{t} = g\\_{ar}\\left(z\\leq{t}\\right)$.\r\n\r\nA density ratio is modelled which preserves the mutual information between $x\\_{t+k}$ and $c\\_{t}$ as follows:\r\n\r\n$$ f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right) \\propto \\frac{p\\left(x\\_{t+k}|c\\_{t}\\right)}{p\\left(x\\_{t+k}\\right)} $$\r\n\r\nwhere $\\propto$ stands for ’proportional to’ (i.e. up to a multiplicative constant). Note that the density ratio $f$ can be unnormalized (does not have to integrate to 1). The authors use a simple log-bilinear model:\r\n\r\n$$ f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right) = \\exp\\left(z^{T}\\_{t+k}W\\_{k}c\\_{t}\\right) $$\r\n\r\nAny type of autoencoder and autoregressive can be used. An example the authors opt for is strided convolutional layers with residual blocks and GRUs.\r\n\r\nThe autoencoder and autoregressive models are trained to minimize an [InfoNCE](https://paperswithcode.com/method/infonce) loss (see components).",
  "title": "Representation Learning with Contrastive Predictive Coding",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "Adversarial Solarization",
  "full_name": "Adversarial Solarization",
  "description": "",
  "title": "Don't Look into the Sun: Adversarial Solarization Attacks on Image Classifiers",
  "collection": "Adversarial Attacks",
  "area": "General"
}
{
  "name": "Unitary RNN",
  "full_name": "Unitary RNN",
  "description": "A **Unitary RNN** is a recurrent neural network architecture that uses a unitary hidden to hidden matrix. Specifically they concern dynamics of the form:\r\n\r\n$$ h\\_{t} = f\\left(Wh\\_{t−1} + Vx\\_{t}\\right) $$\r\n\r\nwhere $W$ is a unitary matrix $\\left(W^{†}W = I\\right)$. The product of unitary matrices is a unitary matrix, so $W$ can be parameterised as a product of simpler unitary matrices:\r\n\r\n$$ h\\_{t} = f\\left(D\\_{3}R\\_{2}F^{−1}D\\_{2}PR\\_{1}FD\\_{1}h\\_{t−1} + Vxt\\right) $$\r\n\r\nwhere $D\\_{3}$, $D\\_{2}$, $D\\_{1}$ are learned diagonal complex matrices, and $R\\_{2}$, $R\\_{1}$ are learned reflection matrices. Matrices $F$ and $F^{−1}$ are the discrete Fourier transformation and its inverse. P is any constant random permutation. The activation function $f\\left(h\\right)$ applies a rectified linear unit with a learned bias to the modulus of each complex number. Only\r\nthe diagonal and reflection matrices, $D$ and $R$, are learned, so Unitary RNNs have fewer parameters than [LSTMs](https://paperswithcode.com/method/lstm) with comparable numbers of hidden units.\r\n\r\nSource: [Associative LSTMs](https://arxiv.org/pdf/1602.03032.pdf)",
  "title": "Unitary Evolution Recurrent Neural Networks",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "Dice Loss",
  "full_name": "Dice Loss",
  "description": "\\begin{equation}\r\nDiceLoss\\left( y, \\overline{p} \\right) = 1 - \\dfrac{\\left(  2y\\overline{p} + 1 \\right)} {\\left( y+\\overline{p } + 1 \\right)}\r\n\\end{equation}",
  "title": "Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "Safety-llamas",
  "full_name": "Safety-llamas",
  "description": "",
  "title": "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions",
  "collection": "Generative Training",
  "area": "Computer Vision"
}
{
  "name": "mT0",
  "full_name": "mT0",
  "description": "**mT0** is a Multitask prompted finetuning (MTF) variant of mT5.",
  "title": "Crosslingual Generalization through Multitask Finetuning",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "Blender",
  "full_name": "Blender",
  "description": "**Blender** is a proposal-based instance mask generation module which incorporates rich instance-level information with accurate dense pixel features. A single [convolution](https://paperswithcode.com/method/convolution) layer is added on top of the detection towers to produce attention masks along with each bounding box prediction. For each predicted instance, the blender crops predicted bases with its bounding box and linearly combines them according the learned attention maps.\r\n\r\nThe inputs of the blender module are bottom-level bases $\\mathbf{B}$, the selected top-level attentions $A$ and bounding box proposals $P$. First [RoIPool](https://paperswithcode.com/method/roi-pooling) of Mask R-CNN to crop bases with each proposal $\\mathbf{p}\\_{d}$ and then resize the region to a fixed size $R \\times R$ feature map $\\mathbf{r}\\_{d}$\r\n\r\n$$\r\n\\mathbf{r}\\_{d}=\\operatorname{RoIPool}_{R \\times R}\\left(\\mathbf{B}, \\mathbf{p}\\_{d}\\right), \\quad \\forall d \\in\\{1 \\ldots D\\}\r\n$$\r\n\r\nMore specifically,  asampling ratio 1 is used for [RoIAlign](https://paperswithcode.com/method/roi-align), i.e. one bin for each sampling point. During training, ground truth boxes are used as the proposals. During inference, [FCOS](https://paperswithcode.com/method/fcos) prediction results are used.\r\n\r\nThe attention size $M$ is smaller than $R$. We interpolate $\\mathbf{a}\\_{d}$ from $M \\times M$ to $R \\times R$, into the shapes of $R=\\left\\(\\mathbf{r}\\_{d} \\mid d=1 \\ldots D\\right)$\r\n\r\n$$\r\n\\mathbf{a}\\_{d}^{\\prime}=\\text { interpolate }\\_{M \\times M \\rightarrow R \\times R}\\left(\\mathbf{a}\\_{d}\\right), \\quad \\forall d \\in\\{1 \\ldots D\\}\r\n$$\r\n\r\nThen $\\mathbf{a}\\_{d}^{\\prime}$ is normalized with a softmax function along the $K$ dimension to make it a set of score maps $\\mathbf{s}\\_{d}$.\r\n\r\n$$\r\n\\mathbf{s}\\_{d}=\\operatorname{softmax}\\left(\\mathbf{a}\\_{d}^{\\prime}\\right), \\quad \\forall d \\in\\{1 \\ldots D\\}\r\n$$\r\n\r\nThen we apply element-wise product between each entity $\\mathbf{r}\\_{d}, \\mathbf{s}\\_{d}$ of the regions $R$ and scores $S$, and sum along the $K$ dimension to get our mask logit $\\mathbf{m}\\_{d}:$\r\n\r\n$$\r\n\\mathbf{m}\\_{d}=\\sum\\_{k=1}^{K} \\mathbf{s}\\_{d}^{k} \\circ \\mathbf{r}\\_{d}^{k}, \\quad \\forall d \\in\\{1 \\ldots D\\}\r\n$$\r\n\r\nwhere $k$ is the index of the basis. The mask blending process with $K=4$ is visualized in the Figure.",
  "title": "BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "Factorized Dense Synthesized Attention",
  "full_name": "Factorized Dense Synthesized Attention",
  "description": "**Factorized Dense Synthesized Attention** is a synthesized attention mechanism, similar to [dense synthesized attention](https://paperswithcode.com/method/dense-synthesized-attention), but we factorize the outputs to reduce parameters and prevent overfitting. It was proposed as part of the [Synthesizer](https://paperswithcode.com/method/synthesizer) architecture. The factorized variant of the dense synthesizer can be expressed as follows:\r\n\r\n$$A, B = F\\_{A}\\left(X\\_{i}\\right), F\\_{B}\\left(X\\_{i}\\right)$$\r\n\r\nwhere $F\\_{A}\\left(.\\right)$ projects input $X\\_{i}$ into $a$ dimensions, $F\\_B\\left(.\\right)$ projects $X\\_{i}$ to $b$ dimensions, and $a \\text{ x } b = l$. The output of the factorized module is now written as:\r\n\r\n$$ Y = \\text{Softmax}\\left(C\\right)G\\left(X\\right) $$\r\n\r\nwhere $C = H\\_{A}\\left(A\\right) * H\\_{B}\\left(B\\right)$, where $H\\_{A}$, $H\\_{B}$ are tiling functions and $C \\in \\mathbb{R}^{l \\text{ x } l}$. The tiling function simply duplicates the vector $k$ times, i.e., $\\mathbb{R}^{l} \\rightarrow \\mathbb{R}^{lk}$. In this case, $H\\_{A}\\left(\\right)$ is a projection of $\\mathbb{R}^{a} \\rightarrow \\mathbb{R}^{ab}$ and $H\\_{B}\\left(\\right)$ is a projection of $\\mathbb{R}^{b} \\rightarrow \\mathbb{R}^{ba}$. To avoid having similar values within the same block, we compose the outputs of $H\\_{A}$ and $H\\_{B}$.",
  "title": "Synthesizer: Rethinking Self-Attention in Transformer Models",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Conditional Instance Normalization",
  "full_name": "Conditional Instance Normalization",
  "description": "**Conditional Instance Normalization** is a normalization technique where all convolutional weights of a style transfer network are shared across many styles.  The goal of the procedure is transform\r\na layer’s activations $x$ into a normalized activation $z$ specific to painting style $s$. Building off\r\n[instance normalization](https://paperswithcode.com/method/instance-normalization), we augment the $\\gamma$ and $\\beta$ parameters so that they’re $N \\times C$ matrices, where $N$ is the number of styles being modeled and $C$ is the number of output feature maps. Conditioning on a style is achieved as follows:\r\n\r\n$$ z = \\gamma\\_{s}\\left(\\frac{x - \\mu}{\\sigma}\\right) + \\beta\\_{s}$$\r\n\r\nwhere $\\mu$ and $\\sigma$ are $x$’s mean and standard deviation taken across spatial axes and $\\gamma\\_{s}$ and $\\beta\\_{s}$ are obtained by selecting the row corresponding to $s$ in the $\\gamma$ and $\\beta$ matrices. One added benefit of this approach is that one can stylize a single image into $N$ painting styles with a single feed forward pass of the network with a batch size of $N$.",
  "title": "A Learned Representation For Artistic Style",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "Deactivable Skip Connection",
  "full_name": "Deactivable Skip Connection",
  "description": "A **Deactivable Skip Connection** is a type of skip connection which, instead of concatenating the encoder features\r\n(red) and decoder features (blue), as with [standard skip connections](https://paperswithcode.com/methods/category/skip-connections), it instead fuses the encoder features with part of the decoder features (light blue), to be able to deactivate this operation when needed.",
  "title": "Cross-modal Deep Face Normals with Deactivable Skip Connections",
  "collection": "Skip Connections",
  "area": "General"
}
{
  "name": "RReLU",
  "full_name": "Randomized Leaky Rectified Linear Units",
  "description": "**Randomized Leaky Rectified Linear Units**, or **RReLU**, are an activation function that randomly samples the negative slope for activation values. It was first proposed and used in the Kaggle NDSB Competition. During training, $a\\_{ji}$ is a random number sampled from a uniform distribution $U\\left(l, u\\right)$. Formally:\r\n\r\n$$ y\\_{ji} = x\\_{ji} \\text{   if } x\\_{ji} \\geq{0} $$\r\n$$ y\\_{ji} = a\\_{ji}x\\_{ji} \\text{   if } x\\_{ji} < 0 $$\r\n\r\nwhere\r\n\r\n$$\\alpha\\_{ji} \\sim U\\left(l, u\\right), l < u \\text{ and } l, u \\in \\left[0,1\\right)$$\r\n\r\nIn the test phase, we take average of all the $a\\_{ji}$ in training similar to [dropout](https://paperswithcode.com/method/dropout), and thus set $a\\_{ji}$ to $\\frac{l+u}{2}$ to get a deterministic result. As suggested by the NDSB competition winner, $a\\_{ji}$ is sampled from $U\\left(3, 8\\right)$. \r\n\r\nAt test time, we use:\r\n\r\n$$ y\\_{ji} = \\frac{x\\_{ji}}{\\frac{l+u}{2}} $$",
  "title": "Empirical Evaluation of Rectified Activations in Convolutional Network",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Non-Local Block",
  "full_name": "Non-Local Block",
  "description": "A **Non-Local Block** is an image block module used in neural networks that wraps a [non-local operation](https://paperswithcode.com/method/non-local-operation). We can define a non-local block as:\r\n\r\n$$ \\mathbb{z}\\_{i} = W\\_{z}\\mathbb{y\\_{i}} + \\mathbb{x}\\_{i} $$\r\n\r\nwhere $y\\_{i}$ is the output from the non-local operation and $+ \\mathbb{x}\\_{i}$ is a [residual connection](https://paperswithcode.com/method/residual-connection).",
  "title": "Non-local Neural Networks",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "CenterMask",
  "full_name": "CenterMask",
  "description": "**CenterMask** is an anchor-free instance segmentation method that adds a novel [spatial attention-guided mask](https://paperswithcode.com/method/spatial-attention-guided-mask) (SAG-Mask) branch to anchor-free one stage object detector ([FCOS](https://paperswithcode.com/method/fcos)) in the same vein with [Mask R-CNN](https://paperswithcode.com/method/mask-r-cnn). Plugged into the FCOS object detector, the SAG-Mask branch predicts a segmentation mask on each detected box with the spatial attention map that helps to focus on informative pixels and suppress noise.",
  "title": "CenterMask : Real-Time Anchor-Free Instance Segmentation",
  "collection": "Instance Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "SAGAN",
  "full_name": "Self-Attention GAN",
  "description": "The **Self-Attention Generative Adversarial Network**, or **SAGAN**, allows for attention-driven, long-range dependency modeling for image generation tasks. Traditional convolutional GANs generate high-resolution details as a function of only spatially local points in lower-resolution feature maps. In SAGAN, details can be generated using cues from all feature locations. Moreover, the discriminator can check that highly detailed features in distant portions of the image are consistent with each other.",
  "title": "Self-Attention Generative Adversarial Networks",
  "collection": "Generative Adversarial Networks",
  "area": "Computer Vision"
}
{
  "name": "RoIPool",
  "full_name": "RoIPool",
  "description": "**Region of Interest Pooling**, or **RoIPool**, is an operation for extracting a small feature map (e.g., $7×7$) from each RoI in detection and segmentation based tasks. Features are extracted from each candidate box, and thereafter in models like [Fast R-CNN](https://paperswithcode.com/method/fast-r-cnn), are then classified and bounding box regression performed.\r\n\r\nThe actual scaling to, e.g., $7×7$, occurs by dividing the region proposal into equally sized sections, finding the largest value in each section, and then copying these max values to the output buffer. In essence, **RoIPool** is [max pooling](https://paperswithcode.com/method/max-pooling) on a discrete grid based on a box.\r\n\r\nImage Source: [Joyce Xu](https://towardsdatascience.com/deep-learning-for-object-detection-a-comprehensive-review-73930816d8d9)",
  "title": "Rich feature hierarchies for accurate object detection and semantic segmentation",
  "collection": "RoI Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "GPFL",
  "full_name": "Graph Path Feature Learning",
  "description": "**Graph Path Feature Learning** is a probabilistic rule learner optimized to mine instantiated first-order logic rules from knowledge graphs. Instantiated rules contain constants extracted from KGs. Compared to abstract rules that contain no constants, instantiated rules are capable of explaining and expressing concepts in more detail. GPFL utilizes a novel two-stage rule generation mechanism that first generalizes extracted paths into templates that are acyclic abstract rules until a certain degree of template saturation is achieved, then specializes the generated templates into instantiated rules.",
  "title": "Towards Learning Instantiated Logical Rules from Knowledge Graphs",
  "collection": "Rule Learners",
  "area": "General"
}
{
  "name": "DynaBERT",
  "full_name": "DynaBERT",
  "description": "**DynaBERT** is a [BERT](https://paperswithcode.com/method/bert)-variant which can flexibly adjust the size and latency by selecting adaptive width and depth. The training process of DynaBERT includes first training a width-adaptive BERT and then allowing both adaptive width and depth, by distilling knowledge from the full-sized model to small sub-networks. Network rewiring is also used to keep the more important attention heads and neurons shared by more sub-networks. \r\n\r\nA two-stage procedure is used to train DynaBERT. First, using knowledge distillation (dashed lines) to transfer the knowledge from a fixed teacher model to student sub-networks with adaptive width in DynaBERTW. Then, using knowledge distillation (dashed lines) to transfer the knowledge from a trained DynaBERTW to student sub-networks with adaptive width and depth in DynaBERT.",
  "title": "DynaBERT: Dynamic BERT with Adaptive Width and Depth",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "Precise RoI Pooling",
  "full_name": "Precise RoI Pooling",
  "description": "**Precise RoI Pooling**, or **PrRoI Pooling**, is a region of interest feature extractor that avoids any quantization of coordinates and has a continuous gradient on bounding box coordinates. Given the feature map $\\mathcal{F}$ before RoI/PrRoI Pooling (eg from Conv4 in [ResNet](https://paperswithcode.com/method/resnet)-50), let $w_{i,j}$ be the feature at one discrete location $(i,j)$ on the feature map. Using bilinear interpolation, the discrete feature map can be considered continuous at any continuous coordinates $(x,y)$:\r\n\r\n$$\r\nf(x,y) = \\sum_{i,j}IC(x,y,i,j) \\times w_{i,j},\r\n$$\r\n\r\nwhere $IC(x,y,i,j) = max(0,1-|x-i|)\\times max(0,1-|y-j|)$ is the interpolation coefficient. Then denote a bin of a RoI as $bin=\\{(x_1,y_1),(x_2,y_2)\\}$, where $(x_1,y_1)$ and $(x_2,y_2)$ are the continuous coordinates of the top-left and bottom-right points, respectively. We perform pooling (e.g. [average pooling](https://paperswithcode.com/method/average-pooling)) given $bin$ and feature map $\\mathcal{F}$ by computing a two-order integral:",
  "title": "Acquisition of Localization Confidence for Accurate Object Detection",
  "collection": "RoI Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "SFT",
  "full_name": "Shrink and Fine-Tune",
  "description": "**Shrink and Fine-Tune**, or **SFT**, is a type of distillation that avoids explicit distillation by copying parameters to a student student model and then fine-tuning. Specifically it extracts a student model from the maximally spaced layers of a fine-tuned teacher. Each layer $l \\in L'$ is copied fully from $L$. For example, when creating a [BART](https://paperswithcode.com/method/bart) student with 3 decoder layers from the 12 encoder layer 12 decoder layer teacher, we copy the teacher’s full $Enc^{L}$ and decoder layers 0, 6, and 11 to the student. When deciding which layers to copy, we break ties arbitrarily; copying layers 0, 5, and 11 might work just as well. When copy only 1 decoder layer, we copy layer 0. This was found this to work better than copying layer 11. The impact of initialization on performance is measured experimentally in Section 6.1. After initialization, the student model continues to fine-tune on the summarization dataset, with the objective of minimizing $\\mathcal{L}\\_{Data}$.",
  "title": "Pre-trained Summarization Distillation",
  "collection": "Knowledge Distillation",
  "area": "General"
}
{
  "name": "LSUV Initialization",
  "full_name": "Layer-Sequential Unit-Variance Initialization",
  "description": "**Layer-Sequential Unit-Variance Initialization** (**LSUV**) is a simple method for weight initialization for deep net learning. The initialization strategy involves the following two step:\r\n\r\n1) First, pre-initialize weights of each [convolution](https://paperswithcode.com/method/convolution) or inner-product layer with\r\northonormal matrices. \r\n\r\n2) Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.",
  "title": "All you need is a good init",
  "collection": "Initialization",
  "area": "General"
}
{
  "name": "XGPT",
  "full_name": "XGPT",
  "description": "XGPT is a method of cross-modal generative pre-training for image captioning designed to pre-train text-to-image caption generators through three novel generation tasks, including image-conditioned masked language modeling (IMLM), image-conditioned denoising autoencoding (IDA), and text-conditioned image feature generation (TIGF). The pre-trained XGPT can be fine-tuned without any task-specific architecture modifications and build strong image captioning models.",
  "title": "XGPT: Cross-modal Generative Pre-Training for Image Captioning",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "Serf",
  "full_name": "Serf",
  "description": "**Serf**, or **Log-Softplus ERror activation Function**, is a type of activation function which is self-regularized and nonmonotonic in nature. It belongs to the [Swish](https://paperswithcode.com/method/swish) family of functions. Serf is defined as:\r\n\r\n$$f\\left(x\\right) = x\\text{erf}\\left(\\ln\\left(1 + e^{x}\\right)\\right)$$",
  "title": "SERF: Towards better training of deep neural networks using log-Softplus ERror activation Function",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Pyramid Pooling Module",
  "full_name": "Pyramid Pooling Module",
  "description": "A **Pyramid Pooling Module** is a module for semantic segmentation which acts as an effective global contextual prior. The motivation is that the problem of using a convolutional network like a [ResNet](https://paperswithcode.com/method/resnet) is that, while the receptive field is already larger than the input image, the empirical receptive field is much smaller than the theoretical one especially on high-level layers. This makes many networks not sufficiently incorporate the momentous global scenery prior. \r\n\r\nThe PPM is an effective global prior representation that addresses this problem. It contains information with different scales and varying among different sub-regions. Using our 4-level pyramid, the pooling kernels cover the whole, half of, and small portions of the image. They are fused as the global prior. Then we concatenate the prior with the original feature map in the final part.",
  "title": "Pyramid Scene Parsing Network",
  "collection": "Semantic Segmentation Modules",
  "area": "Computer Vision"
}
{
  "name": "ScatNet",
  "full_name": "Scattering Transform",
  "description": "A wavelet **scattering transform** computes a translation invariant representation, which is stable to deformation, using a deep [convolution](https://paperswithcode.com/method/convolution) network architecture. It computes non-linear invariants with modulus and averaging pooling functions. It helps to eliminate the image variability due to translation and is stable to deformations. \r\n\r\nImage source: [Bruna and Mallat](https://arxiv.org/pdf/1203.1513v2.pdf)",
  "title": "Invariant Scattering Convolution Networks",
  "collection": "Image Representations",
  "area": "Computer Vision"
}
{
  "name": "SqueezeNet",
  "full_name": "SqueezeNet",
  "description": "**SqueezeNet** is a convolutional neural network that employs design strategies to reduce the number of parameters, notably with the use of fire modules that \"squeeze\" parameters using 1x1 convolutions.",
  "title": "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "GHM-R",
  "full_name": "Gradient Harmonizing Mechanism R",
  "description": "**GHM-R** is a loss function designed to balance the gradient flow for bounding box refinement. The GHM first performs statistics on the number of examples with similar attributes w.r.t their gradient density and then attaches a harmonizing parameter to the gradient of each example according to the density. The modification of gradient can be equivalently implemented by reformulating the loss function. Embedding the GHM into the bounding box regression branch is denoted as GHM-R loss.",
  "title": "Gradient Harmonized Single-stage Detector",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "APPNP",
  "full_name": "Approximation of Personalized Propagation of Neural Predictions",
  "description": "Neural message-passing algorithms for semi-supervised classification on graphs have recently achieved great success. However, for classifying a node these methods only consider nodes that are a few propagation steps away and the size of this utilized neighbourhood is hard to extend. This paper uses the relationship between graph convolutional networks (GCN) and PageRank to derive an improved propagation scheme based on personalized PageRank. We utilize this propagation procedure to construct a simple model, personalized propagation of neural predictions (PPNP), and its fast approximation, APPNP. Our model's training time is on par or faster and its number of parameters is on par or lower than previous models. It leverages a large, adjustable neighbourhood for classification and can be easily combined with any neural network. We show that this model outperforms several recently proposed methods for semi-supervised classification in the most thorough study done so far for GCN-like models.",
  "title": "Predict then Propagate: Graph Neural Networks meet Personalized PageRank",
  "collection": "Graph Representation Learning",
  "area": "Graphs"
}
{
  "name": "SGD",
  "full_name": "Stochastic Gradient Descent",
  "description": "**Stochastic Gradient Descent** is an iterative optimization technique that uses minibatches of data to form an expectation of the gradient, rather than the full gradient using all available data. That is for weights $w$ and a loss function $L$ we have:\r\n\r\n$$ w\\_{t+1} = w\\_{t} - \\eta\\hat{\\nabla}\\_{w}{L(w\\_{t})} $$\r\n\r\nWhere $\\eta$ is a learning rate. SGD reduces redundancy compared to batch gradient descent - which recomputes gradients for similar examples before each parameter update - so it is usually much faster.\r\n\r\n(Image Source: [here](http://rasbt.github.io/mlxtend/user_guide/general_concepts/gradient-optimization/))",
  "title": null,
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Global-Local Attention",
  "full_name": "Global-Local Attention",
  "description": "**Global-Local Attention** is a type of attention mechanism used in the [ETC](https://paperswithcode.com/method/etc) architecture. ETC receives two separate input sequences: the global input $x^{g} = (x^{g}\\_{1}, \\dots, x^{g}\\_{n\\_{g}})$ and the long input $x^{l} = (x^{l}\\_{1}, \\dots x^{l}\\_{n\\_{l}})$. Typically, the long input contains the input a [standard Transformer](https://paperswithcode.com/method/transformer) would receive, while the global input contains a much smaller number of auxiliary tokens ($n\\_{g}  \\ll n\\_{l}$). Attention is then split into four separate pieces: global-to-global (g2g), global-tolong (g2l), long-to-global (l2g), and long-to-long (l2l). Attention in the l2l piece (the most computationally expensive piece) is restricted to a fixed radius $r \\ll n\\_{l}$. To compensate for this limited attention span, the tokens in the global input have unrestricted attention, and thus long input tokens can transfer information to each other through global input tokens. Accordingly, g2g, g2l, and l2g pieces of attention are unrestricted.",
  "title": "ETC: Encoding Long and Structured Inputs in Transformers",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "ASU",
  "full_name": "Amplifying Sine Unit: An Oscillatory Activation Function for Deep Neural Networks to Recover Nonlinear Oscillations Efficiently",
  "description": "2023",
  "title": "ArXiving Before Submission Helps Everyone",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Epsilon Greedy Exploration",
  "full_name": "Epsilon Greedy Exploration",
  "description": "**$\\epsilon$-Greedy Exploration** is an exploration strategy in reinforcement learning that takes an exploratory action with probability $\\epsilon$ and a greedy action with probability $1-\\epsilon$. It tackles the exploration-exploitation tradeoff with reinforcement learning algorithms: the desire to explore the state space with the desire to seek an optimal policy. Despite its simplicity, it is still commonly used as an behaviour policy $\\pi$ in several state-of-the-art reinforcement learning models.\r\n\r\nImage Credit: [Robin van Embden](https://cran.r-project.org/web/packages/contextual/vignettes/sutton_barto.html)",
  "title": null,
  "collection": "Behaviour Policies",
  "area": "Reinforcement Learning"
}
{
  "name": "Cross-Scale Non-Local Attention",
  "full_name": "Cross-Scale Non-Local Attention",
  "description": "**Cross-Scale Non-Local Attention**, or **CS-NL**,  is a non-local attention module for image super-resolution deep networks. It learns to mine long-range dependencies between LR features to larger-scale HR patches within the same feature map. Specifically, suppose we are conducting an s-scale super-resolution with the module, given a feature map $X$ of spatial size $(W, H)$, we first bilinearly downsample it to $Y$ with scale $s$, and match the $p\\times p$ patches in $X$ with the downsampled $p \\times p$ candidates in $Y$ to obtain the [softmax](https://paperswithcode.com/method/softmax) matching score. Finally, we conduct deconvolution.on the score by weighted adding the patches of size $\\left(sp, sp\\right)$ extracted from $X$. The obtained $Z$ of size $(sW, sH)$ will be $s$ times super-resolved than $X$.",
  "title": "Image Super-Resolution with Cross-Scale Non-Local Attention and Exhaustive Self-Exemplars Mining",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "AVSlowFast",
  "full_name": "Audiovisual SlowFast Network",
  "description": "**Audiovisual SlowFast Network**, or **AVSlowFast**, is an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are integrated with a Faster Audio pathway to model vision and sound in a unified representation. Audio and visual features are fused at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, [DropPathway](https://paperswithcode.com/method/droppathway) is used, which randomly drops the Audio pathway during training as an effective regularization technique. Inspired by prior studies in neuroscience, hierarchical audiovisual synchronization is performed to learn joint audiovisual features.",
  "title": "Audiovisual SlowFast Networks for Video Recognition",
  "collection": "Video Recognition Models",
  "area": "Computer Vision"
}
{
  "name": "HFPSO",
  "full_name": "Hybrid Firefly and Particle Swarm Optimization",
  "description": "**Hybrid Firefly and Particle Swarm Optimization (HFPSO)** is a metaheuristic optimization algorithm that combines strong points of firefly and particle swarm optimization. HFPSO tries to determine the start of the local search process properly by checking the previous global best fitness values.\r\n\r\n[Click Here for the Paper](https://www.sciencedirect.com/science/article/abs/pii/S156849461830084X)\r\n\r\n[Codes (MATLAB)](https://www.mathworks.com/matlabcentral/fileexchange/67768-a-hybrid-firefly-and-particle-swarm-optimization-hfpso)",
  "title": null,
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "I-BERT",
  "full_name": "I-BERT",
  "description": "**I-BERT** is a quantized version of [BERT](https://paperswithcode.com/method/bert) that quantizes the entire inference with integer-only arithmetic. Based on lightweight integer only approximation methods for nonlinear operations, e.g., [GELU](https://paperswithcode.com/method/gelu), [Softmax](https://paperswithcode.com/method/softmax), and [Layer Normalization](https://paperswithcode.com/method/layer-normalization), it performs an end-to-end integer-only [BERT](https://paperswithcode.com/method/bert) inference without any floating point calculation.\r\n\r\nIn particular, GELU and Softmax are approximated with lightweight second-order polynomials, which can be evaluated with integer-only arithmetic. For LayerNorm, integer-only computation is performed by leveraging a known algorithm for integer calculation of\r\nsquare root.",
  "title": "I-BERT: Integer-only BERT Quantization",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "DropPathway",
  "full_name": "DropPathway",
  "description": "**DropPathway** randomly drops an audio pathway during training as a regularization technique for audiovisual recognition models.  Specifically, at each training iteration, we drop the Audio pathway altogether with probability $P\\_{d}$. This way, we slow down the learning of the Audio pathway and make its learning dynamics more compatible with its visual counterpart. When dropping the audio pathway, we sum zero tensors with the visual pathways.\r\n\r\nNote that DropPathway is different from simply setting different learning rates for the audio/visual pathways in that it 1) ensures the audio pathway has fewer parameter updates, 2) hinders the visual pathway to 'shortcut' training by memorizing audio information, and 3) provides extra regularization as different audio clips are dropped in each epoch.",
  "title": "Audiovisual SlowFast Networks for Video Recognition",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Hi-LANDER",
  "full_name": "Hi-LANDER",
  "description": "**Hi-LANDER** is a hierarchical [graph neural network](https://paperswithcode.com/methods/category/graph-models) (GNN) model that learns how to cluster a set of images into an unknown number of identities using an image annotated with labels belonging to a disjoint set of identities. The hierarchical GNN uses an approach to merge connected components predicted at each level of the hierarchy to form a new graph at the next level. Unlike fully unsupervised hierarchical clustering, the choice of grouping and complexity criteria stems naturally from supervision in the training set.",
  "title": "Learning Hierarchical Graph Neural Networks for Image Clustering",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "Groupwise Point Convolution",
  "full_name": "Groupwise Point Convolution",
  "description": "A **Groupwise Point Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) where we apply a [point convolution](https://paperswithcode.com/method/pointwise-convolution) groupwise (using different set of convolution filter groups).\r\n\r\nImage Credit: [Chi-Feng Wang](https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)",
  "title": "ESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural Network",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "Adaptive NMS",
  "full_name": "Adaptive NMS",
  "description": "**Adaptive Non-Maximum Suppression** is a non-maximum suppression algorithm that applies a dynamic suppression threshold to an instance according to the target density. The motivation is to find an NMS algorithm that works well for pedestrian detection in a crowd. Intuitively, a high NMS threshold keeps more crowded instances while a low NMS threshold wipes out more false positives. The adaptive-NMS thus applies a dynamic suppression strategy, where the threshold rises as instances gather and occlude each other and decays when instances appear separately. To this end, an auxiliary and learnable sub-network is designed to predict the adaptive NMS threshold for each instance.",
  "title": "Adaptive NMS: Refining Pedestrian Detection in a Crowd",
  "collection": "Proposal Filtering",
  "area": "Computer Vision"
}
{
  "name": "BiLSTM",
  "full_name": "Bidirectional LSTM",
  "description": "A **Bidirectional LSTM**, or **biLSTM**, is a sequence processing model that consists of two LSTMs: one taking the input in a forward direction, and the other in a backwards direction. BiLSTMs effectively increase the amount of information available to the network, improving the context available to the algorithm (e.g. knowing what words immediately follow *and* precede a word in a sentence).\r\n\r\nImage Source: Modelling Radiological Language with Bidirectional Long Short-Term Memory Networks, Cornegruta et al",
  "title": null,
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "Darknet-53",
  "full_name": "Darknet-53",
  "description": "**Darknet-53** is a convolutional neural network that acts as a backbone for the [YOLOv3](https://paperswithcode.com/method/yolov3) object detection approach. The improvements upon its predecessor [Darknet-19](https://paperswithcode.com/method/darknet-19) include the use of residual connections, as well as more layers.",
  "title": "YOLOv3: An Incremental Improvement",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Sliding Window Attention",
  "full_name": "Sliding Window Attention",
  "description": "**Sliding Window Attention** is an attention pattern for attention-based models. It was proposed as part of the [Longformer](https://paperswithcode.com/method/longformer) architecture. It is motivated by the fact that non-sparse attention in the original [Transformer](https://paperswithcode.com/method/transformer) formulation has a [self-attention component](https://paperswithcode.com/method/scaled) with $O\\left(n^{2}\\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. Given the importance of local context, the sliding window attention pattern employs a fixed-size window attention surrounding each token. Using multiple stacked layers of such windowed attention results in a large receptive field, where top layers have access to all input locations and have the capacity to build representations that incorporate information across the entire input. \r\n\r\nMore formally, in this attention pattern, given a fixed window size $w$, each token attends to $\\frac{1}{2}w$ tokens on each side. The computation complexity of this pattern is $O\\left(n×w\\right)$,\r\nwhich scales linearly with input sequence length $n$. To make this attention pattern efficient, $w$ should be small compared with $n$. But a model with typical multiple stacked transformers will have a large receptive field. This is analogous to CNNs where stacking layers of small kernels leads to high level features that are built from a large portion of the input (receptive field)\r\n\r\nIn this case, with a transformer of $l$ layers, the receptive field size is $l × w$ (assuming\r\n$w$ is fixed for all layers). Depending on the application, it might be helpful to use different values of $w$ for each layer to balance between efficiency and model representation capacity.",
  "title": "Longformer: The Long-Document Transformer",
  "collection": "Attention Patterns",
  "area": "Natural Language Processing"
}
{
  "name": "Denoised Smoothing",
  "full_name": "Denoised Smoothing",
  "description": "**Denoised Smoothing** is a method for obtaining a provably robust classifier from a fixed pretrained one, without any additional training or fine-tuning of the latter. The basic idea is to prepend a custom-trained denoiser before the pretrained classifier, and then apply randomized smoothing. Randomized smoothing is a certified defense that converts any given classifier $f$ into a new smoothed classifier $g$ that is characterized by a non-linear Lipschitz property. When queried at a point $x$, the smoothed classifier $g$ outputs the class that is most likely to be returned by $f$ under isotropic Gaussian perturbations of its inputs. Unfortunately, randomized smoothing requires that the underlying classifier $f$ is robust to relatively large random Gaussian perturbations of the input, which is not the case for off-the-shelf pretrained models. By applying our custom-trained denoiser to the classifier $f$, we can effectively make $f$ robust to such Gaussian perturbations, thereby making it “suitable” for randomized smoothing.",
  "title": "Denoised Smoothing: A Provable Defense for Pretrained Classifiers",
  "collection": "Robustness Methods",
  "area": "General"
}
{
  "name": "AE",
  "full_name": "Autoencoders",
  "description": "An **autoencoder** is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”. Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name. \r\n\r\nExtracted from: [Wikipedia](https://en.wikipedia.org/wiki/Autoencoder)\r\n\r\nImage source: [Wikipedia](https://en.wikipedia.org/wiki/Autoencoder#/media/File:Autoencoder_schema.png)",
  "title": null,
  "collection": "Dimensionality Reduction",
  "area": "General"
}
{
  "name": "Population Based Augmentation",
  "full_name": "Population Based Augmentation",
  "description": "**Population Based Augmentation**, or **PBA**, is a data augmentation strategy (PBA), which generates nonstationary augmentation policy schedules instead of a fixed augmentation policy. In PBA we consider the augmentation policy search problem as a special case of hyperparameter schedule learning. It leverages [Population Based Training](https://paperswithcode.com/method/population-based-training) (PBT), a hyperparameter search algorithm which\r\noptimizes the parameters of a network jointly with their hyperparameters to maximize performance. The output of PBT is not an optimal hyperparameter configuration but rather a trained model and schedule of hyperparameters. \r\n\r\nIn PBA, we are only interested in the learned schedule and discard the child model result (similar to [AutoAugment](https://paperswithcode.com/method/autoaugment)). This learned augmentation schedule can then be used to improve the training of different (i.e., larger and costlier to train) models on the same dataset.\r\n\r\nPBT executes as follows. To start, a fixed population of models are randomly initialized and trained in parallel. At certain intervals, an “exploit-and-explore” procedure is applied to the worse performing population members, where the model clones the weights of a better performing model (i.e., exploitation) and then perturbs the hyperparameters of the cloned model to search in the hyperparameter space (i.e., exploration). Because the weights of the models are cloned and never reinitialized, the total computation required is the computation to train a single model times the population size.",
  "title": "Population Based Augmentation: Efficient Learning of Augmentation Policy Schedules",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "FashionCLIP",
  "full_name": "FashionCLIP",
  "description": "FashionCLIP is a fine-tuned CLIP model on fashion data (more than 800K pairs). It is the first foundation model for Fashion.",
  "title": "Contrastive language and vision learning of general fashion concepts",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "NVAE",
  "full_name": "Nouveau VAE",
  "description": "**NVAE**, or **Nouveau VAE**, is deep, hierarchical variational autoencoder. It can be trained with the original [VAE](https://paperswithcode.com/method/vae) objective, unlike alternatives such as [VQ-VAE-2](https://paperswithcode.com/method/vq-vae-2). NVAE’s design focuses on tackling two main challenges: (i) designing expressive neural\r\nnetworks specifically for VAEs, and (ii) scaling up the training to a large number of hierarchical\r\ngroups and image sizes while maintaining training stability.\r\n\r\nTo tackle long-range correlations in the data, the model employs hierarchical multi-scale modelling. The generative model starts from a small spatially arranged latent variables as $\\mathbf{z}\\_{1}$ and samples from the hierarchy group-by-group while gradually doubling the spatial dimensions. This multi-scale approach enables NVAE to capture global long-range correlations at the top of the hierarchy and local fine-grained dependencies at the lower groups.\r\n\r\nAdditional design choices include the use of residual cells for the generative models and the encoder, which employ a number of tricks and modules to achieve good performance, and the use of residual normal distributions to smooth optimization. See the components section for more details.",
  "title": "NVAE: A Deep Hierarchical Variational Autoencoder",
  "collection": "Likelihood-Based Generative Models",
  "area": "Computer Vision"
}
{
  "name": "TabNet",
  "full_name": "TabNet",
  "description": "**TabNet** is a deep tabular data learning architecture that uses sequential attention to choose which features to reason from at each decision step.\r\n\r\nThe TabNet encoder is composed of a feature transformer, an attentive transformer and feature masking. A split block\r\ndivides the processed representation to be used by the attentive transformer of the subsequent step as well as for the overall output. For each step, the feature selection mask provides interpretable information about the model’s functionality, and the masks can be aggregated to obtain global feature important attribution. The TabNet decoder is composed of a feature transformer block at each step. \r\n\r\nIn the feature transformer block, a 4-layer network is used, where 2 are shared across all decision steps and 2 are decision step-dependent. Each layer is composed of a fully-connected (FC) layer, BN and GLU nonlinearity. An attentive transformer block example – a single layer mapping is modulated with a prior scale information which aggregates how much each feature has been used before the current decision step. sparsemax is used for normalization of the coefficients, resulting in sparse selection of the salient features.",
  "title": "TabNet: Attentive Interpretable Tabular Learning",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "Vulnerability-constrained Decoding",
  "full_name": "Vulnerability-constrained Decoding",
  "description": "**Vulnerability-constrained Decoding**, is a sequence decoding approach that aims to avoid generating vulnerabilities in generated code.",
  "title": "Efficient Avoidance of Vulnerabilities in Auto-completed Smart Contract Code Using Vulnerability-constrained Decoding",
  "collection": "Sequence Decoding Methods",
  "area": "Natural Language Processing"
}
{
  "name": "Hopfield Layer",
  "full_name": "Hopfield Layer",
  "description": "A **Hopfield Layer** is a module that enables a network to associate two sets of vectors. This general functionality allows for [transformer](https://paperswithcode.com/method/transformer)-like self-attention, for decoder-encoder attention, for time series prediction (maybe with positional encoding), for sequence analysis, for multiple instance learning, for learning with point sets, for combining data sources by associations, for constructing a memory, for averaging and pooling operations, and for many more. \r\n\r\nIn particular, the Hopfield layer can readily be used as plug-in replacement for existing layers like pooling layers ([max-pooling](https://paperswithcode.com/method/max-pooling) or [average pooling](https://paperswithcode.com/method/average-pooling), permutation equivariant layers, [GRU](https://paperswithcode.com/method/gru) & [LSTM](https://paperswithcode.com/method/lstm) layers, and attention layers. The Hopfield layer is based on modern Hopfield networks with continuous states that have very high storage capacity and converge after one update.",
  "title": "Hopfield Networks is All You Need",
  "collection": "Pooling Operations",
  "area": "Computer Vision"
}
{
  "name": "HS-ResNet",
  "full_name": "HS-ResNet",
  "description": "**HS-ResNet** is a [convolutional neural network](https://paperswithcode.com/methods/category/convolutional-neural-networks) that employs [Hierarchical-Split Block](https://paperswithcode.com/method/hierarchical-split-block) as its central building block within a [ResNet](https://paperswithcode.com/method/resnet)-like architecture.",
  "title": "HS-ResNet: Hierarchical-Split Block on Convolutional Neural Network",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Graph Neural Network",
  "full_name": "Graph Neural Network",
  "description": "",
  "title": "Graph Neural Networks: A Review of Methods and Applications",
  "collection": "Graph Representation Learning",
  "area": "Graphs"
}
{
  "name": "NTK",
  "full_name": "Neural Tangent Kernel",
  "description": "",
  "title": "Neural Tangent Kernel: Convergence and Generalization in Neural Networks",
  "collection": "Kernel Methods",
  "area": "General"
}
{
  "name": "JLA",
  "full_name": "Joint Learning Architecture",
  "description": "**JLA**, or **Joint Learning Architecture**, is an approach for multiple object tracking and trajectory forecasting. It jointly trains a tracking and trajectory forecasting model, and the trajectory forecasts are used for short-term motion estimates in lieu of linear motion prediction methods such as the Kalman filter. It uses a [FairMOT](https://paperswithcode.com/method/fairmot) model as the base model because this architecture already performs detection and tracking. A forecasting branch is added to the network and is trained end-to-end. [FairMOT](https://paperswithcode.com/method/fairmot) consist of a backbone network utilizing [Deep Layer Aggregation](https://www.paperswithcode.com/method/dla), an object detection head, and a reID head.",
  "title": "Joint Learning Architecture for Multiple Object Tracking and Trajectory Forecasting",
  "collection": "Multi-Object Tracking Models",
  "area": "Computer Vision"
}
{
  "name": "Parrot",
  "full_name": "Parrot",
  "description": "**Parrot** is an imitation learning approach to automatically learn cache access patterns by leveraging Belady’s optimal policy. Belady’s optimal policy is an oracle policy that computes the theoretically optimal cache eviction decision based on knowledge of future cache accesses, which Parrot approximates with a policy that only conditions on the past accesses.",
  "title": "An Imitation Learning Approach for Cache Replacement",
  "collection": "Cache Replacement Models",
  "area": "General"
}
{
  "name": "Child-Tuning",
  "full_name": "Child-Tuning",
  "description": "**Child-Tuning** is a fine-tuning technique that updates a subset of parameters (called child network) of large pretrained models via strategically masking out the gradients of the non-child network during the backward process. It decreases the hypothesis space of the model via a task-specific mask applied to the full gradients, helping to effectively adapt the large-scale pretrained model to various tasks and meanwhile aiming to maintain its original generalization ability.",
  "title": "Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning",
  "collection": "Fine-Tuning",
  "area": "General"
}
{
  "name": "ClassSR",
  "full_name": "ClassSR",
  "description": "**ClassSR** is a framework to accelerate super-resolution (SR) networks on large images (2K-8K). ClassSR combines classification and SR in a unified framework. In particular, it first uses a Class-Module to classify the sub-images into different classes according to restoration difficulties, then applies an SR-Module to perform SR for different classes. The Class-Module is a conventional classification network, while the SR-Module is a network container that consists of the to-be-accelerated SR network and its simplified versions.",
  "title": "ClassSR: A General Framework to Accelerate Super-Resolution Networks by Data Characteristic",
  "collection": "Image Super-Resolution Models",
  "area": "Computer Vision"
}
{
  "name": "ConvBERT",
  "full_name": "ConvBERT",
  "description": "**ConvBERT** is a modification on the [BERT](https://paperswithcode.com/method/bert) architecture which uses a [span-based dynamic convolution](https://paperswithcode.com/method/span-based-dynamic-convolution) to replace self-attention heads to directly model local dependencies. Specifically a new [mixed attention module](https://paperswithcode.com/method/mixed-attention-block) replaces the [self-attention modules](https://paperswithcode.com/method/scaled) in BERT, which leverages the advantages of [convolution](https://paperswithcode.com/method/convolution) to better capture local dependency. Additionally, a new span-based dynamic convolution operation is used to utilize multiple input tokens to dynamically generate the convolution kernel. Lastly, ConvBERT also incorporates some new model designs including the bottleneck attention and grouped linear operator for the feed-forward module (reducing the number of parameters).",
  "title": "ConvBERT: Improving BERT with Span-based Dynamic Convolution",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Generalized Focal Loss",
  "full_name": "Generalized Focal Loss",
  "description": "**Generalized Focal Loss (GFL)** is a loss function for object detection that combines Quality [Focal Loss](https://paperswithcode.com/method/focal-loss) and Distribution Focal Loss into a general form.",
  "title": "Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "VSGNet",
  "full_name": "Visual-Spatial-Graph Network",
  "description": "**Visual-Spatial-Graph Network** (VSGNet) is a network for human-object interaction detection. It extracts visual features from the image representing the human-object pair, refines the features with spatial configurations of the pair, and utilizes the structural connections between the pair via graph convolutions.",
  "title": "VSGNet: Spatial Attention Network for Detecting Human Object Interactions Using Graph Convolutions",
  "collection": "Human Object Interaction Detectors",
  "area": "Computer Vision"
}
{
  "name": "DVD-GAN",
  "full_name": "DVD-GAN",
  "description": "**DVD-GAN** is a generative adversarial network for video generation built upon the [BigGAN](https://paperswithcode.com/method/biggan) architecture.\r\n\r\nDVD-GAN uses two discriminators: a Spatial Discriminator $\\mathcal{D}\\_{S}$ and a\r\nTemporal Discriminator $\\mathcal{D}\\_{T}$. $\\mathcal{D}\\_{S}$ critiques single frame content and structure by randomly sampling $k$ full-resolution frames and judging them individually.  The temporal discriminator $\\mathcal{D}\\_{T}$ must provide $G$ with the learning signal to generate movement (not evaluated by $\\mathcal{D}\\_{S}$).\r\n\r\nThe input to $G$ consists of a Gaussian latent noise $z \\sim N\\left(0, I\\right)$ and a learned linear embedding $e\\left(y\\right)$ of the desired class $y$. Both inputs are 120-dimensional vectors. $G$ starts by computing an affine transformation of $\\left[z; e\\left(y\\right)\\right]$ to a $\\left[4, 4, ch\\_{0}\\right]$-shaped tensor. $\\left[z; e\\left(y\\right)\\right]$ is used as the input to all class-[conditional Batch Normalization](https://paperswithcode.com/method/conditional-batch-normalization) layers\r\nthroughout $G$. This is then treated as the input (at each frame we would like to generate) to a Convolutional [GRU](https://paperswithcode.com/method/gru).\r\n\r\nThis RNN is unrolled once per frame. The output of this RNN is processed by two residual blocks. The time dimension is combined with the batch dimension here, so each frame proceeds through the blocks independently. The output of these blocks has width and height dimensions which\r\nare doubled (we skip upsampling in the first block). This is repeated a number of times, with the\r\noutput of one RNN + residual group fed as the input to the next group, until the output tensors have\r\nthe desired spatial dimensions. \r\n\r\nThe spatial discriminator $\\mathcal{D}\\_{S}$ functions almost identically to BigGAN’s discriminator. A score is calculated for each of the uniformly sampled $k$ frames (default $k = 8$) and the $\\mathcal{D}\\_{S}$ output is the sum over per-frame scores. The temporal discriminator $\\mathcal{D}\\_{T}$ has a similar architecture, but pre-processes the real or generated video with a $2 \\times 2$ average-pooling downsampling function $\\phi$. Furthermore, the first two residual blocks of $\\mathcal{D}\\_{T}$ are 3-D, where every [convolution](https://paperswithcode.com/method/convolution) is replaced with a 3-D convolution with a kernel size of $3 \\times 3 \\times 3$. The rest of the architecture follows BigGAN.",
  "title": "Adversarial Video Generation on Complex Datasets",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "HTCN",
  "full_name": "Hierarchical Transferability Calibration Network",
  "description": "**Hierarchical Transferability Calibration Network** (HTCN) is an adaptive object detector that hierarchically (local-region/image/instance) calibrates the transferability of feature representations for harmonizing transferability and discriminability. The proposed model consists of three components: (1) Importance Weighted Adversarial Training with input Interpolation (IWAT-I), which strengthens the global discriminability by re-weighting the interpolated image-level features; (2) Context-aware Instance-Level Alignment (CILA) module, which enhances the local discriminability by capturing the complementary effect between the instance-level feature and the global context information for the instance-level feature alignment; (3) local feature masks that calibrate the local transferability to provide semantic guidance for the following discriminative pattern alignment.",
  "title": "Harmonizing Transferability and Discriminability for Adapting Object Detectors",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "ENet Bottleneck",
  "full_name": "ENet Bottleneck",
  "description": "**ENet Bottleneck** is an image model block used in the [ENet](https://paperswithcode.com/method/enet) semantic segmentation architecture. Each block consists of three convolutional layers: a 1 × 1 projection that reduces the dimensionality, a main convolutional layer, and a 1 × 1 expansion. We place [Batch Normalization](https://paperswithcode.com/method/batch-normalization) and [PReLU](https://paperswithcode.com/method/prelu) between all convolutions. If the bottleneck is downsampling, a [max pooling](https://paperswithcode.com/method/max-pooling) layer is added to the main branch.\r\nAlso, the first 1 × 1 projection is replaced with a 2 × 2 [convolution](https://paperswithcode.com/method/convolution) with stride 2 in both dimensions. We zero pad the activations, to match the number of feature maps.",
  "title": "ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Temporal ROIAlign",
  "full_name": "Temporal ROIAlign",
  "description": "**Temporal ROI Align** is an operator for extracting features from other frames' feature maps for current frame proposals by utilizing feature similarity. Considering the features of the same object instance are highly similar among frames in a video, the proposed operator implicitly extracts the most similar ROI features from support frames feature map for target frame proposals based on feature similarity.",
  "title": "Temporal RoI Align for Video Object Recognition",
  "collection": "RoI Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "Inception v2",
  "full_name": "Inception v2",
  "description": "**Inception v2** is the second generation of Inception convolutional neural network architectures which notably uses [batch normalization](https://paperswithcode.com/method/batch-normalization). Other changes include dropping [dropout](https://paperswithcode.com/method/dropout) and removing [local response normalization](https://paperswithcode.com/method/local-response-normalization), due to the benefits of batch normalization.",
  "title": "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Triplet Loss",
  "full_name": "Triplet Loss",
  "description": "The goal of **Triplet loss**, in the context of Siamese Networks, is to maximize the joint probability among all score-pairs i.e. the product of all probabilities. By using its negative logarithm, we can get the loss formulation as follows:\r\n\r\n$$\r\nL\\_{t}\\left(\\mathcal{V}\\_{p}, \\mathcal{V}\\_{n}\\right)=-\\frac{1}{M N} \\sum\\_{i}^{M} \\sum\\_{j}^{N} \\log \\operatorname{prob}\\left(v p\\_{i}, v n\\_{j}\\right)\r\n$$\r\n\r\nwhere the balance weight $1/MN$ is used to keep the loss with the same scale for different number of instance sets.",
  "title": "Triplet Loss in Siamese Network for Object Tracking",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "NeuroTactic",
  "full_name": "NeuroTactic",
  "description": "**NeuroTactic** is a model for theorem proving which leverages [graph neural networks](https://paperswithcode.com/methods/category/graph-models) to represent the theorem and premises, and applies graph contrastive learning for pre-training. Specifically, premise selection is designed as a pretext task for the graph contrastive learning approach. The learned representations are then used for the downstream task, tactic prediction",
  "title": "Graph Contrastive Pre-training for Effective Theorem Reasoning",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "JDeskew",
  "full_name": "Adaptive Radial Projection on Fourier Magnitude Spectrum",
  "description": "JDeskew is a novel skew estimation method that extracts the dominant skew angle of the given document image by applying an Adaptive Radial Projection on the 2D Discrete Fourier Magnitude spectrum.",
  "title": "Adaptive Radial Projection on Fourier Magnitude Spectrum for Document Image Skew Estimation",
  "collection": "Image Denoising Models",
  "area": "Computer Vision"
}
{
  "name": "PPMC",
  "full_name": "Path Planning and Motion Control",
  "description": "**Path Planning and Motion Control**, or **PPMC RL**, is a training algorithm that teaches path planning and motion control to robots using reinforcement learning in a simulated environment. The focus is on promoting generalization where there are environmental uncertainties such as rough environments like lunar services. The algorithm is coupled with any generic reinforcement learning algorithm to teach robots how to respond to user commands and to travel to designated locations on a single neural network. The algorithm works independently of the robot structure, demonstrating that it works on a wheeled rover in addition to the past results on a quadruped walking robot.",
  "title": "PPMC RL Training Algorithm: Rough Terrain Intelligent Robots through Reinforcement Learning",
  "collection": "Motion Control",
  "area": "Reinforcement Learning"
}
{
  "name": "SRMM",
  "full_name": "Stochastic Regularized Majorization-Minimization",
  "description": "",
  "title": "Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "GeniePath",
  "full_name": "GeniePath",
  "description": "GeniePath is a scalable approach for learning adaptive receptive fields of neural networks defined on permutation invariant graph data. In GeniePath, we propose an adaptive path layer consists of two complementary functions designed for breadth and depth exploration respectively, where the former learns the importance of different sized neighborhoods, while the latter extracts and filters signals aggregated from neighbors of different hops away.\r\n\r\nDescription and image from: [GeniePath: Graph Neural Networks with Adaptive Receptive Paths](https://arxiv.org/pdf/1802.00910.pdf)",
  "title": "GeniePath: Graph Neural Networks with Adaptive Receptive Paths",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "ARMA",
  "full_name": "ARMA GNN",
  "description": "The ARMA GNN layer implements a rational graph filter with a recursive approximation.",
  "title": "Graph Neural Networks with convolutional ARMA filters",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "RoBERTa",
  "full_name": "RoBERTa",
  "description": "**RoBERTa** is an extension of [BERT](https://paperswithcode.com/method/bert) with changes to the pretraining procedure. The modifications include: \r\n\r\n- training the model longer, with bigger batches, over more data\r\n- removing the next sentence prediction objective\r\n- training on longer sequences\r\n- dynamically changing the masking pattern applied to the training data. The authors also collect a large new dataset ($\\text{CC-News}$) of comparable size to other privately used datasets, to better control for training set size effects",
  "title": "RoBERTa: A Robustly Optimized BERT Pretraining Approach",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Corner Pooling",
  "full_name": "Corner Pooling",
  "description": "**Corner Pooling** is a pooling technique for object detection that seeks to better localize corners by encoding explicit prior knowledge. Suppose we want to determine if a pixel at location $\\left(i, j\\right)$ is a top-left corner. Let $f\\_{t}$ and $f\\_{l}$ be the feature maps that are the inputs to the top-left corner pooling layer, and let $f\\_{t\\_{ij}}$ and $f\\_{l\\_{ij}}$ be the vectors at location $\\left(i, j\\right)$ in $f\\_{t}$ and $f\\_{l}$ respectively. With $H \\times W$ feature maps, the corner pooling layer first max-pools all feature vectors between $\\left(i, j\\right)$ and $\\left(i, H\\right)$ in $f\\_{t}$ to a feature vector $t\\_{ij}$ , and max-pools all feature vectors between $\\left(i, j\\right)$ and $\\left(W, j\\right)$ in $f\\_{l}$ to a feature vector $l\\_{ij}$. Finally, it adds $t\\_{ij}$ and $l\\_{ij}$ together.",
  "title": "CornerNet: Detecting Objects as Paired Keypoints",
  "collection": "Pooling Operations",
  "area": "Computer Vision"
}
{
  "name": "VoVNet",
  "full_name": "VoVNet",
  "description": "**VoVNet** is a convolutional neural network that seeks to make [DenseNet](https://paperswithcode.com/method/densenet) more efficient by concatenating all features only once in the last feature map, which makes input size constant and enables enlarging new output channel. In the Figure to the right, $F$ represents a [convolution](https://paperswithcode.com/method/convolution) layer and $\\otimes$ indicates concatenation.",
  "title": "An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Aging Evolution",
  "full_name": "Aging Evolution",
  "description": "**Aging Evolution**, or **Regularized Evolution**, is an evolutionary algorithm for [neural architecture search](https://paperswithcode.com/method/neural-architecture-search). Whereas in tournament selection, the best architectures are kept, in aging evolution we associate each genotype with an age, and bias the tournament selection to choose\r\nthe younger genotypes. In the context of architecture search, aging evolution allows us to explore the search space more, instead of zooming in on good models too early, as non-aging evolution would.",
  "title": "Regularized Evolution for Image Classifier Architecture Search",
  "collection": "Neural Architecture Search",
  "area": "General"
}
{
  "name": "KnowPrompt",
  "full_name": "KnowPrompt",
  "description": "**KnowPrompt** is a prompt-tuning approach for relational understanding. It injects entity and relation knowledge into prompt construction with learnable virtual template words as well as answer words and synergistically optimize their representation with knowledge constraints. To be specific, TYPED MARKER is utilized around entities initialized with aggregated entity-type embeddings as learnable virtual template words to inject entity type knowledge. The average embeddings of each token are leveraged in relation labels as virtual answer words to inject relation knowledge. Since there exist implicit structural constraints among entities and relations, and virtual words should be consistent with the surrounding contexts, synergistic optimization is introduced to obtain optimized virtual templates and answer words. Concretely, a context-aware prompt calibration method is used with implicit structural constraints to inject structural knowledge implications among relational triples and associate prompt embeddings with each other.",
  "title": "KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction",
  "collection": "Prompt Engineering",
  "area": "General"
}
{
  "name": "GHM-C",
  "full_name": "Gradient Harmonizing Mechanism C",
  "description": "**GHM-C** is a loss function designed to balance the gradient flow for anchor classification. The GHM first performs statistics on the number of examples with similar attributes w.r.t their gradient density and then attaches a harmonizing parameter to the gradient of each example according to the density. The modification of gradient can be equivalently implemented by reformulating the loss function. Embedding the GHM into the classification loss is denoted as GHM-C loss. Since the gradient density is a statistical variable depending on the examples distribution in a mini-batch, GHM-C is a dynamic loss that can adapt to the change of data distribution in each batch as well as to the updating of the model.",
  "title": "Gradient Harmonized Single-stage Detector",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "MT-PET",
  "full_name": "MT-PET",
  "description": "**MT-PET** is a multi-task version of [Pattern Exploiting Training](https://arxiv.org/abs/2001.07676) (PET) for exaggeration detection, which leverages knowledge from complementary cloze-style QA tasks to improve few-shot learning. It defines pairs of complementary pattern-verbalizer pairs for a main task and auxiliary task. These PVPs are then used to train PET on data from both tasks.\r\n\r\nPET uses the masked language modeling objective of pretrained language models to transform a task into one or more cloze-style question answering tasks.  In the original PET implementation, PVPs are defined for a single target task. MT-PET extends this by allowing for auxiliary PVPs from related tasks, adding complementary cloze-style QA tasks during training. The motivation for the multi-task approach is two-fold: 1) complementary cloze-style tasks can potentially help the model to learn different aspects of the main task, i.e. the similar tasks of exaggeration detection and claim strength prediction; 2) data on related tasks can be utilized during training, which is important in situations where data for the main task is limited.",
  "title": "Semi-Supervised Exaggeration Detection of Health Science Press Releases",
  "collection": "Exaggeration Detection Models",
  "area": "Natural Language Processing"
}
{
  "name": "DANCE",
  "full_name": "Domain Adaptative Neighborhood Clustering via Entropy Optimization",
  "description": "**Domain Adaptive Neighborhood Clustering via Entropy Optimization (DANCE)** is a self-supervised clustering method that harnesses the cluster structure of the target domain using self-supervision. This is done with a neighborhood clustering technique that self-supervises feature learning in the target. At the same time, useful source features and class boundaries are preserved and adapted with a partial domain alignment loss that the authors refer to as entropy separation loss. This loss allows the model to either match each target example with the source, or reject it as unknown.",
  "title": "Universal Domain Adaptation through Self Supervision",
  "collection": "Domain Adaptation",
  "area": "General"
}
{
  "name": "Inception-ResNet-v2-B",
  "full_name": "Inception-ResNet-v2-B",
  "description": "**Inception-ResNet-v2-B** is an image model block for a 17 x 17 grid used in the [Inception-ResNet-v2](https://paperswithcode.com/method/inception-resnet-v2) architecture. It largely follows the idea of Inception modules - and grouped convolutions - but also includes residual connections.",
  "title": "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Target Policy Smoothing",
  "full_name": "Target Policy Smoothing",
  "description": "**Target Policy Smoothing** is a regularization strategy for the value function in reinforcement learning. Deterministic policies can overfit to narrow peaks in the value estimate, making them highly susceptible to functional approximation error, increasing the variance of the target. To reduce this variance, target policy smoothing adds a small amount of random noise to the target policy and averages over mini-batches - approximating a [SARSA](https://paperswithcode.com/method/sarsa)-like expectation/integral.\r\n\r\nThe modified target update is:\r\n\r\n$$ y = r + \\gamma{Q}\\_{\\theta'}\\left(s', \\pi\\_{\\theta'}\\left(s'\\right) + \\epsilon \\right) $$\r\n\r\n$$ \\epsilon \\sim \\text{clip}\\left(\\mathcal{N}\\left(0, \\sigma\\right), -c, c \\right) $$\r\n\r\nwhere the added noise is clipped to keep the target close to the original action. The outcome is an algorithm reminiscent of [Expected SARSA](https://paperswithcode.com/method/expected-sarsa), where the value estimate is instead learned off-policy and the noise added to the target policy is chosen independently of the exploration policy. The value estimate learned is with respect to a noisy policy defined by the parameter $\\sigma$.",
  "title": "Addressing Function Approximation Error in Actor-Critic Methods",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Concurrent Spatial and Channel Squeeze & Excitation",
  "full_name": "Concurrent Spatial and Channel Squeeze & Excitation (scSE)",
  "description": "Combines the channel attention of the widely known [spatial squeeze and channel excitation (SE)](https://paperswithcode.com/method/squeeze-and-excitation-block) block and the spatial attention of the [channel squeeze and spatial excitation (sSE)](https://paperswithcode.com/method/channel-squeeze-and-spatial-excitation#) block to build a spatial and channel attention mechanism for image segmentation tasks.",
  "title": "Recalibrating Fully Convolutional Networks with Spatial and Channel 'Squeeze & Excitation' Blocks",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Sparse Sinkhorn Attention",
  "full_name": "Sparse Sinkhorn Attention",
  "description": "**Sparse Sinkhorn Attention** is an attention mechanism that reduces the memory complexity of the [dot-product attention mechanism](https://paperswithcode.com/method/scaled) and is capable of learning sparse attention outputs. It is based on the idea of differentiable sorting of internal representations within the self-attention module. SSA incorporates a meta sorting network that learns to rearrange and sort input sequences. Sinkhorn normalization is used to normalize the rows and columns of the sorting matrix. The actual SSA attention mechanism then acts on the block sorted sequences.",
  "title": "Sparse Sinkhorn Attention",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "HGS",
  "full_name": "Hunger Games Search",
  "description": "**Hunger Games Search (**HGS**)** is a general-purpose population-based optimization technique with a simple structure, special stability features and very competitive performance to realize the solutions of both constrained and unconstrained problems more effectively. HGS is designed according to the hunger-driven activities and behavioural choice of animals. This dynamic, fitness-wise search method follows a simple concept of “Hunger” as the most crucial homeostatic motivation and reason for behaviours, decisions, and actions in the life of all animals to make the process of optimization more understandable and consistent for new users and decision-makers. The Hunger Games Search incorporates the concept of hunger into the feature process; in other words, an adaptive weight based on the concept of hunger is designed and employed to simulate the effect of hunger on each search step. It follows the computationally logical rules (games) utilized by almost all animals and these rival activities and games are often adaptive evolutionary by securing higher chances of survival and food acquisition. This method's main feature is its dynamic nature, simple structure, and high performance in terms of convergence and acceptable quality of solutions, proving to be more efficient than the current optimization methods. \r\n\r\nImplementation of the HGS algorithm is available at [https://aliasgharheidari.com/HGS.html](https://aliasgharheidari.com/HGS.html).",
  "title": null,
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "OFA",
  "full_name": "OFA",
  "description": "In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains. Our code and models are publicly available at https://github.com/OFA-Sys/OFA.",
  "title": "OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "PipeMare",
  "full_name": "PipeMare",
  "description": "**PipeMare** is an asynchronous (bubble-free) pipeline parallel method for training large neural networks. It involves two main techniques: learning rate rescheduling and discrepancy correction.",
  "title": "PipeMare: Asynchronous Pipeline Parallel DNN Training",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "DROID-SLAM",
  "full_name": "DROID-SLAM",
  "description": "**DROID-SLAM** is a deep learning based SLAM system. It consists of recurrent iterative updates of camera pose and pixelwise depth through a Dense Bundle Adjustment layer. This layer leverages geometric constraints, improves accuracy and robustness, and enables a monocular system to handle stereo or RGB-D input without retraining. It builds a dense 3D map of the environment while simultaneously localizing the camera within the map.",
  "title": "DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras",
  "collection": "SLAM Methods",
  "area": "General"
}
{
  "name": "ZoomNet",
  "full_name": "ZoomNet",
  "description": "**ZoomNet** is a 2D human whole-body pose estimation technique. It aims to localize dense landmarks on the entire human body including face, hands, body, and feet. ZoomNet follows the top-down paradigm. Given a human bounding box of each person, ZoomNet first localizes the easy-to-detect body keypoints and estimates the rough position of hands and face. Then it zooms in to focus on the hand/face areas and predicts keypoints using features with higher resolution for accurate localization. Unlike previous approaches which usually assemble multiple networks, ZoomNet has a single network that is end-to-end trainable. It unifies five network heads including the human body pose estimator, hand and face detectors, and hand and face pose estimators into a single network with shared low-level features.",
  "title": "Whole-Body Human Pose Estimation in the Wild",
  "collection": "Pose Estimation Models",
  "area": "Computer Vision"
}
{
  "name": "CeiT",
  "full_name": "Convolution-enhanced image Transformer",
  "description": "**Convolution-enhanced image Transformer** (**CeiT**) combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Three modifications are made to the original Transformer: 1) instead of the straightforward tokenization from raw input images, we design an **Image-to-Tokens** (**I2T**) module that extracts patches from generated low-level features; 2) the feed-froward network in each encoder block is replaced with a **Locally-enhanced Feed-Forward** (**LeFF**) layer that promotes the correlation among neighbouring tokens in the spatial dimension; 3) a **Layer-wise Class token Attention** (**LCA**) is attached at the top of the Transformer that utilizes the multi-level representations.",
  "title": "Incorporating Convolution Designs into Visual Transformers",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "CRN",
  "full_name": "Conditional Relation Network",
  "description": "**Conditional Relation Network**, or **CRN**, is a building block to construct more sophisticated structures for representation and reasoning over video. CRN takes as input an array of tensorial objects and a conditioning feature, and computes an array of encoded output objects. Model building becomes a simple exercise of replication, rearrangement and stacking of these reusable units for diverse modalities and contextual information. This design thus supports high-order relational and multi-step reasoning.",
  "title": "Hierarchical Conditional Relation Networks for Video Question Answering",
  "collection": "Video Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "DU-GAN",
  "full_name": "DU-GAN",
  "description": "**DU-GAN** is a [generative adversarial network](https://www.paperswithcode.com/methods/category/generative-adversarial-networks) for LDCT denoising in medical imaging. The generator produces denoised LDCT images, and two independent branches with [U-Net](https://paperswithcode.com/method/u-net) based discriminators perform at the image and gradient domains. The U-Net based discriminator provides both global structure and local per-pixel feedback to the generator. Furthermore, the image discriminator encourages the generator to produce photo-realistic CT images while the gradient discriminator is utilized for better edge and alleviating streak artifacts caused by photon starvation.",
  "title": "DU-GAN: Generative Adversarial Networks with Dual-Domain U-Net Based Discriminators for Low-Dose CT Denoising",
  "collection": "Generative Adversarial Networks",
  "area": "Computer Vision"
}
{
  "name": "LIMix",
  "full_name": "Lifelong Infinite Mixture",
  "description": "**LIMix**, or **Lifelong Infinite Mixture**, is a lifelong learning model which grows a mixture of models to adapt to an increasing number of tasks.  LIMix can automatically expand its network architectures or choose an appropriate component to adapt its parameters for learning a new task, while preserving its previously learnt information. Knowledge is incorporated by means of Dirichlet processes by using a gating mechanism which computes the dependence between the knowledge learnt previously and stored in each component, and a new set of data. Besides, a Student model is trained which can accumulate cross-domain representations over time and make quick inferences.",
  "title": "Lifelong Infinite Mixture Model Based on Knowledge-Driven Dirichlet Process",
  "collection": "Lifelong Learning",
  "area": "General"
}
{
  "name": "Embedded Dot Product Affinity",
  "full_name": "Embedded Dot Product Affinity",
  "description": "**Embedded Dot Product Affinity** is a type of affinity or self-similarity function between two points $\\mathbb{x\\_{i}}$ and $\\mathbb{x\\_{j}}$ that uses a dot product function in an embedding space:\r\n\r\n$$ f\\left(\\mathbb{x\\_{i}}, \\mathbb{x\\_{j}}\\right) = \\theta\\left(\\mathbb{x\\_{i}}\\right)^{T}\\phi\\left(\\mathbb{x\\_{j}}\\right) $$\r\n\r\nHere $\\theta\\left(x\\_{i}\\right) = W\\_{θ}x\\_{i}$ and $\\phi\\left(x\\_{j}\\right) = W\\_{φ}x\\_{j}$ are two embeddings.\r\n\r\nThe main difference between the dot product and [embedded Gaussian affinity](https://paperswithcode.com/method/embedded-gaussian-affinity) functions is the presence of [softmax](https://paperswithcode.com/method/softmax), which plays the role of an activation function.",
  "title": "Non-local Neural Networks",
  "collection": "Affinity Functions",
  "area": "General"
}
{
  "name": "Sample Redistribution",
  "full_name": "Sample Redistribution",
  "description": "**Sample Redistribution** is a [data augmentation](https://paperswithcode.com/methods/category/image-data-augmentation) technique for face detection which augments training samples based on the statistics of benchmark datasets via large-scale cropping. During training data augmentation, square patches are cropped from the original images with a random size from the set $[0.3,1.0]$ of the short edge of the original images. To generate more positive samples for stride 8, the random size range is enlarged from $[0.3,1.0]$ to $[0.3,2.0]$. When the crop box is beyond the original image, average RGB values fill the missing pixels.\r\n\r\nThe motivation is that for efficient [face detection](https://paperswithcode.com/task/face-detection) under a fixed VGA resolution (i.e. 640×480), most of the faces (78.93%) in [WIDER FACE](https://paperswithcode.com/dataset/wider-face-1) are smaller than 32×32 pixels, and thus they are predicted by shallow stages. To obtain more training samples for these shallow stages, Sample Redistribution (SR) is used.",
  "title": "Sample and Computation Redistribution for Efficient Face Detection",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "PolyConv",
  "full_name": "Polynomial Convolution",
  "description": "PolyConv learns continuous distributions as the convolutional filters to share the weights across different vertices of graphs or points of point clouds.",
  "title": "PolyNet: Polynomial Neural Network for 3D Shape Recognition with PolyShape Representation",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "DCGAN",
  "full_name": "Deep Convolutional GAN",
  "description": "**DCGAN**, or **Deep Convolutional GAN**, is a generative adversarial network architecture. It uses a couple of guidelines, in particular:\r\n\r\n- Replacing any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).\r\n- Using batchnorm in both the generator and the discriminator.\r\n- Removing fully connected hidden layers for deeper architectures.\r\n- Using [ReLU](https://paperswithcode.com/method/relu) activation in generator for all layers except for the output, which uses tanh.\r\n- Using LeakyReLU activation in the discriminator for all layer.",
  "title": "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "BiSeNet V2",
  "full_name": "BiSeNet V2",
  "description": "**BiSeNet V2** is a two-pathway architecture for real-time semantic segmentation. One pathway is designed to capture the spatial details with wide channels and shallow layers, called Detail Branch. In contrast, the other pathway is introduced to extract the categorical semantics with narrow channels and deep layers, called Semantic Branch. The Semantic Branch simply requires a large receptive field to capture semantic context, while the detail information can be supplied by the Detail Branch. Therefore, the Semantic Branch can be made very lightweight with fewer channels and a fast-downsampling strategy. Both types of feature representation are merged to construct a stronger and more comprehensive feature representation.",
  "title": "BiSeNet V2: Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "SPP-Net",
  "full_name": "SPP-Net",
  "description": "**SPP-Net** is a convolutional neural architecture that employs [spatial pyramid pooling](https://paperswithcode.com/method/spatial-pyramid-pooling) to remove the fixed-size constraint of the network. Specifically, we add an SPP layer on top of the last convolutional layer. The SPP layer pools the features and generates fixed-length outputs, which are then fed into the fully-connected layers (or other classifiers). In other words, we perform some information aggregation at a deeper stage of the network hierarchy (between convolutional layers and fully-connected layers) to avoid the need for cropping or warping at the beginning.",
  "title": "Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "WGAN",
  "full_name": "Wasserstein GAN",
  "description": "**Wasserstein GAN**, or **WGAN**, is a type of generative adversarial network that minimizes an approximation of the Earth-Mover's distance (EM) rather than the Jensen-Shannon divergence as in the original [GAN](https://paperswithcode.com/method/gan) formulation. It leads to more stable training than original GANs with less evidence of mode collapse, as well as meaningful curves that can be used for debugging and searching hyperparameters.",
  "title": "Wasserstein GAN",
  "collection": "Generative Adversarial Networks",
  "area": "Computer Vision"
}
{
  "name": "ManifoldPlus",
  "full_name": "ManifoldPlus",
  "description": "**ManifoldPlus** is a method for robust and scalable conversion of triangle soups to watertight manifolds. It extracts exterior faces between occupied voxels and empty voxels, and uses a projection based optimization method to accurately recover a watertight manifold that resembles the reference mesh. It does not rely on face normals of the input triangle soups and can accurately recover zero-volume structures. For scalability, it employs an adaptive Gauss-Seidel method for shape optimization, in which each step is an easy-to-solve convex problem.",
  "title": "ManifoldPlus: A Robust and Scalable Watertight Manifold Surface Generation Method for Triangle Soups",
  "collection": "Graphics Models",
  "area": "Computer Vision"
}
{
  "name": "Contextual Residual Aggregation",
  "full_name": "Contextual Residual Aggregation",
  "description": "**Contextual Residual Aggregation**, or **CRA**, is a module for image inpainting. It can produce high-frequency residuals for missing contents by weighted aggregating residuals from contextual patches, thus only requiring a low-resolution prediction from the network. Specifically, it involves a neural network to predict a low-resolution inpainted result and up-sample it to yield a large blurry image. Then we produce the high-frequency residuals for in-hole patches by aggregating weighted high-frequency residuals from contextual patches. Finally, we add the aggregated residuals to the large blurry image to obtain a sharp result.",
  "title": "Contextual Residual Aggregation for Ultra High-Resolution Image Inpainting",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Movement Pruning",
  "full_name": "Movement Pruning",
  "description": "**Movement Pruning** is a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. Magnitude pruning can be seen as utilizing zeroth-order information (absolute value) of the running model. In contrast, movement pruning methods are where importance is derived from first-order information. Intuitively, instead of selecting weights that are far from zero, we retain connections that are moving away from zero during the training process.",
  "title": "Movement Pruning: Adaptive Sparsity by Fine-Tuning",
  "collection": "Pruning",
  "area": "General"
}
{
  "name": "SSDS",
  "full_name": "Self-Supervised Deep Supervision",
  "description": "The method exploits the finding that high correlation of segmentation performance among each U-Net's decoder layer -- with discriminative layer attached -- tends to have higher segmentation performance in the final segmentation map. By introducing an \"Inter-layer Divergence Loss\", based on Kulback-Liebler Divergence, to promotes the consistency between each discriminative output from decoder layers by minimizing the divergence.\r\n\r\nIf we assume that each decoder layer is equivalent to PDE function parameterized by weight parameter $\\theta$:\r\n\r\n$Decoder_i(x;\\theta_i) \\equiv PDE(x;\\theta_i)$\r\n\r\nThen our objective is trying to make each discriminative output similar to each other:\r\n\r\n$PDE(x; \\theta_d) \\sim PDE(x; \\theta_i);\\text{ } 0 \\leq i < d$\r\n\r\nHence the objective is to $\\text{minimize} \\sum_{i=0}^{d} D_{KL}(\\hat{y} || Decoder_i)$.",
  "title": "OCTAve: 2D en face Optical Coherence Tomography Angiography Vessel Segmentation in Weakly-Supervised Learning with Locality Augmentation",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "FPG",
  "full_name": "Feature Pyramid Grid",
  "description": "**Feature Pyramid Grids**, or **FPG**, is a deep multi-pathway feature pyramid, that represents the feature scale-space as a regular grid of parallel bottom-up pathways which are fused by multi-directional lateral connections. It connects the backbone features, $C$, of a ConvNet with a regular structure of $p$ parallel top-down pyramid pathways which are fused by multi-directional lateral connections, AcrossSame, AcrossUp, AcrossDown, and AcrossSkip. AcrossSkip are direct connections while all other types use [convolutional](https://paperswithcode.com/method/convolution) and [ReLU](https://paperswithcode.com/method/relu) layers.\r\n\r\nOn a high-level, FPG is a deep generalization of [FPN](https://paperswithcode.com/method/fpn) from one to $p$ pathways under a dense lateral connectivity structure.",
  "title": "Feature Pyramid Grids",
  "collection": "Feature Pyramid Blocks",
  "area": "Computer Vision"
}
{
  "name": "ProxylessNet-GPU",
  "full_name": "ProxylessNet-GPU",
  "description": "**ProxylessNet-GPU** is a convolutional neural network architecture learnt with the [ProxylessNAS](https://paperswithcode.com/method/proxylessnas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) algorithm that is optimized for GPU devices. It uses inverted residual blocks (MBConvs) from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2) as its basic building block.",
  "title": "ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "Enhanced Fusion Framework",
  "full_name": "Enhanced Fusion Framework",
  "description": "The **Enhanced Fusion Framework** proposes three different ideas to improve the existing MI-based BCI frameworks.\r\n\r\nImage source: [Fumanal-Idocin et al.](https://arxiv.org/pdf/2101.06968v1.pdf)",
  "title": "Motor-Imagery-Based Brain Computer Interface using Signal Derivation and Aggregation Functions",
  "collection": "Time Series Analysis",
  "area": "Sequential"
}
{
  "name": "PAFNet",
  "full_name": "Paddle Anchor Free Network",
  "description": "**PAFNet** is an anchor-free detector for object detection that removes pre-defined anchors and regresses the locations directly, which can achieve higher efficiency. The overall network is composed of a backbone, an up-sampling module, an AGS module, a localization branch and a regression branch. Specifically,  ResNet50-vd is chosen as the backbone for server side, and [MobileNetV3](https://paperswithcode.com/method/mobilenetv3) for mobile side. Besides, for mobile side, we replace traditional [convolution](https://paperswithcode.com/method/convolution) layers with lite convolution operators.",
  "title": "PAFNet: An Efficient Anchor-Free Object Detector Guidance",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "context2vec",
  "full_name": "context2vec",
  "description": "**context2vec** is an unsupervised model for learning generic context embedding of wide sentential contexts, using a bidirectional [LSTM](https://paperswithcode.com/method/lstm). A large plain text corpora is trained on to learn a neural model that embeds entire sentential contexts and target words in the same low-dimensional space, which\r\nis optimized to reflect inter-dependencies between targets and their entire sentential context as a whole. \r\n\r\nIn contrast to word2vec that use context modeling mostly internally and considers the target word embeddings as their main output, the focus of context2vec is the context representation. context2vec achieves its objective by assigning similar embeddings to sentential contexts and their associated target words.",
  "title": "context2vec: Learning Generic Context Embedding with Bidirectional LSTM",
  "collection": "Word Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "UNet++",
  "full_name": "UNet++",
  "description": "UNet++ is an architecture for semantic segmentation based on the [U-Net](https://paperswithcode.com/method/u-net). Through the use of densely connected nested decoder sub-networks, it enhances extracted feature processing and was reported by its authors to outperform the U-Net in [Electron Microscopy (EM)](https://imagej.net/events/isbi-2012-segmentation-challenge), [Cell](https://acsjournals.onlinelibrary.wiley.com/doi/full/10.1002/cncy.21576), [Nuclei](https://www.kaggle.com/c/data-science-bowl-2018), [Brain Tumor](https://paperswithcode.com/dataset/brats-2013-1), [Liver](https://paperswithcode.com/dataset/lits17) and [Lung Nodule](https://paperswithcode.com/dataset/lidc-idri) medical image segmentation tasks.",
  "title": "UNet++: A Nested U-Net Architecture for Medical Image Segmentation",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "SGDW",
  "full_name": "SGDW",
  "description": "**SGDW** is a stochastic optimization technique that decouples [weight decay](https://paperswithcode.com/method/weight-decay) from the gradient update:\r\n\r\n$$ g\\_{t} =  \\nabla{f\\_{t}}\\left(\\theta\\_{t-1}\\right) + \\lambda\\theta\\_{t-1}$$\r\n\r\n$$ m\\_{t} =  \\beta\\_{1}m\\_{t-1} + \\eta\\_{t}\\alpha{g}\\_{t}$$\r\n\r\n$$ \\theta\\_{t} = \\theta\\_{t-1} - m\\_{t} - \\eta\\_{t}\\lambda\\theta\\_{t-1}$$",
  "title": "Decoupled Weight Decay Regularization",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "hdxresnet",
  "full_name": "Hybrid-deconvolution",
  "description": "A resnet-like architecture with deconvolution feature normalization (Ye et al. 2020, ICLR) layers in the first few layers for sparse low-level feature identification, and batch normalization layers in the later layers.",
  "title": "Predicting galaxy spectra from images with hybrid convolutional neural networks",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "CTC Loss",
  "full_name": "Connectionist Temporal Classification Loss",
  "description": "A **Connectionist Temporal Classification Loss**, or **CTC Loss**, is designed for tasks where we need alignment between sequences, but where that alignment is difficult - e.g. aligning each character to its location in an audio file. It calculates a loss between a continuous (unsegmented) time series and a target sequence. It does this by summing over the probability of possible alignments of input to target, producing a loss value which is differentiable with respect to each input node. The alignment of input to target is assumed to be “many-to-one”, which limits the length of the target sequence such that it must be $\\leq$ the input length.",
  "title": null,
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "Protagonist Antagonist Induced Regret Environment Design",
  "full_name": "Protagonist Antagonist Induced Regret Environment Design",
  "description": "**Protagonist Antagonist Induced Regret Environment Design**, or **PAIRED**, is an adversarial method for approximate minimax regret to generate environments for reinforcement learning. It introduces an antagonist which is allied with the environment generating adversary. The primary agent we are trying to train is the protagonist. The environment adversary’s goal is to design environments in which the antagonist achieves high reward and the protagonist receives low reward. If the adversary generates unsolvable environments, the antagonist and protagonist would perform the same and the adversary would get a score of zero, but if the adversary finds environments the antagonist solves and the protagonist does not solve, the adversary achieves a positive score. Thus, the environment adversary is incentivized to create challenging but feasible environments, in which the antagonist can outperform the protagonist. Moreover, as the protagonist learns to solves the simple environments, the antagonist must generate more complex environments to make the protagonist fail, increasing the complexity of the generated tasks and leading to automatic curriculum generation.",
  "title": "Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design",
  "collection": "Adversarial Training",
  "area": "General"
}
{
  "name": "Window-based Discriminator",
  "full_name": "Window-based Discriminator",
  "description": "A **Window-based Discriminator** is a type of discriminator for generative adversarial networks. It is analogous to a [PatchGAN](https://paperswithcode.com/method/patchgan) but designed for audio. While a standard [GAN](https://paperswithcode.com/method/gan) discriminator learns to classify between distributions of entire audio sequences, window-based discriminator learns to classify between distribution of small audio chunks. Since the discriminator loss is computed over the overlapping windows where each window is very large (equal to the receptive field of the discriminator), the model learns to maintain coherence across patches.",
  "title": "MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis",
  "collection": "Discriminators",
  "area": "General"
}
{
  "name": "DeepLab",
  "full_name": "DeepLab",
  "description": "**DeepLab** is a semantic segmentation architecture. First, the input image goes through the network with the use of dilated convolutions. Then the output from the network is bilinearly interpolated and goes through the fully connected [CRF](https://paperswithcode.com/method/crf) to fine tune the result we obtain the final predictions.",
  "title": "Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "GGS-NNs",
  "full_name": "Gated Graph Sequence Neural Networks",
  "description": "Gated Graph Sequence Neural Networks (GGS-NNs) is a novel graph-based neural network model. GGS-NNs modifies Graph Neural Networks (Scarselli et al., 2009) to use gated recurrent units and modern optimization techniques and then extend to output sequences.\r\n\r\nSource: [Li et al.](https://arxiv.org/pdf/1511.05493v4.pdf)\r\n\r\nImage source: [Li et al.](https://arxiv.org/pdf/1511.05493v4.pdf)",
  "title": "Gated Graph Sequence Neural Networks",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "Mobile Neural Network",
  "full_name": "MNN",
  "description": "**Mobile Neural Network (MNN)** is a mobile inference engine tailored to mobile applications. The contributions of MNN include: (1) presenting a mechanism called pre-inference that manages to conduct runtime optimization; (2) delivering thorough kernel optimization on operators to achieve optimal computation performance; (3) introducing backend abstraction module which enables hybrid scheduling and keeps the engine lightweight.",
  "title": "MNN: A Universal and Efficient Inference Engine",
  "collection": "Inference Engines",
  "area": "General"
}
{
  "name": "LightGCN",
  "full_name": "LightGCN",
  "description": "**LightGCN** is a type of [graph convolutional neural network](https://paperswithcode.com/method/gcn) (GCN), including only the most essential component in GCN (neighborhood aggregation) for collaborative filtering. Specifically, LightGCN learns user and item embeddings by linearly propagating them on the user-item interaction graph, and uses the weighted sum of the embeddings learned at all layers as the final embedding.",
  "title": "LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation",
  "collection": "Recommendation Systems",
  "area": "General"
}
{
  "name": "Mixed Attention Block",
  "full_name": "Mixed Attention Block",
  "description": "**Mixed Attention Block** is an attention module used in the [ConvBERT](https://paperswithcode.com/method/convbert) architecture. It is a mixture of [self-attention](https://paperswithcode.com/method/scaled) and [span-based dynamic convolution](https://paperswithcode.com/method/span-based-dynamic-convolution) (highlighted in pink). They share the same Query but use different Key to generate the attention map and [convolution](https://paperswithcode.com/method/convolution) kernel respectively. The number of attention heads is reducing by directly projecting the input to a smaller embedding space to form a bottleneck structure for self-attention and span-based dynamic convolution. Dimensions of the input and output of some blocks are labeled on the left top corner to illustrate the overall framework, where $d$ is the embedding size of the input and $\\gamma$ is the reduction ratio.",
  "title": "ConvBERT: Improving BERT with Span-based Dynamic Convolution",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "Low-level backbone",
  "full_name": "Low-level backbone",
  "description": "",
  "title": "EfficientPose: Scalable single-person pose estimation",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "GradDrop",
  "full_name": "Gradient Sign Dropout",
  "description": "**GradDrop**, or **Gradient Sign Dropout**, is a probabilistic masking procedure which samples gradients at an activation layer based on their level of consistency. It is applied as a layer in any standard network forward pass, usually on the final layer before the prediction head to save on compute overhead and maximize benefits during backpropagation. Below, we develop the GradDrop formalism. Throughout, o denotes elementwise multiplication after any necessary tiling operations (if any) are completed.\r\nTo implement GradDrop, we first define the Gradient Positive Sign Purity, $\\mathcal{P}$, as\r\n\r\n$$\r\n\\mathcal{P}=\\frac{1}{2}\\left(1+\\frac{\\sum\\_{i} \\nabla L_\\{i}}{\\sum\\_{i}\\left|\\nabla L\\_{i}\\right|}\\right)\r\n$$\r\n\r\n$\\mathcal{P}$ is bounded by $[0,1] .$ For multiple gradient values $\\nabla\\_{a} L\\_{i}$ at some scalar $a$, we see that $\\mathcal{P}=0$ if $\\nabla_{a} L\\_{i}<0 $ $\\forall i$, while $\\mathcal{P}=1$ if $\\nabla\\_{a} L\\_{i}>0$ $\\forall i $. Thus, $\\mathcal{P}$ is a measure of how many positive gradients are present at any given value. We then form a mask for each gradient $\\mathcal{M}\\_{i}$ as follows:\r\n\r\n$$\r\n\\mathcal{M}\\_{i}=\\mathcal{I}[f(\\mathcal{P})>U] \\circ \\mathcal{I}\\left[\\nabla L\\_{i}>0\\right]+\\mathcal{I}[f(\\mathcal{P})<U] \\circ \\mathcal{I}\\left[\\nabla L\\_{i}<0\\right]\r\n$$\r\n\r\nfor $\\mathcal{I}$ the standard indicator function and $f$ some monotonically increasing function (often just the identity) that maps $[0,1] \\mapsto[0,1]$ and is odd around $(0.5,0.5)$. $U$ is a tensor composed of i.i.d $U(0,1)$ random variables. The $\\mathcal{M}\\_{i}$ is then used to produce a final gradient $\\sum \\mathcal{M}\\_{i} \\nabla L\\_{i}$",
  "title": "Just Pick a Sign: Optimizing Deep Multitask Models with Gradient Sign Dropout",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Performer",
  "full_name": "Performer",
  "description": "**Performer** is a [Transformer](https://paperswithcode.com/methods/category/transformers) architectures which can estimate regular ([softmax](https://paperswithcode.com/method/softmax)) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. To approximate softmax attention-kernels, Performers use a Fast Attention Via positive Orthogonal Random features approach (FAVOR+), leveraging new methods for approximating softmax and Gaussian kernels.",
  "title": "Rethinking Attention with Performers",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "EvoNorms",
  "full_name": "EvoNorms",
  "description": "**EvoNorms** are a set of normalization-activation layers that go beyond existing design patterns. Normalization and activation are unified into a single computation graph, its structure is evolved starting from low-level primitives. EvoNorms consist of two series: B series and S series. The B series are batch-dependent and were discovered by our method without any constraint. The S series work on individual samples, and were discovered by rejecting any batch-dependent operations.",
  "title": "Evolving Normalization-Activation Layers",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "Inverse Square Root Schedule",
  "full_name": "Inverse Square Root Schedule",
  "description": "**Inverse Square Root** is a learning rate schedule 1 / $\\sqrt{\\max\\left(n, k\\right)}$ where\r\n$n$ is the current training iteration and $k$ is the number of warm-up steps. This sets a constant learning rate for the first $k$ steps, then exponentially decays the learning rate until pre-training is over.",
  "title": null,
  "collection": "Learning Rate Schedules",
  "area": "General"
}
{
  "name": "Revision Network",
  "full_name": "Revision Network",
  "description": "**Revision Network** is a style transfer module that aims to revise the rough stylized image via generating residual details image $r_{c s}$, while the final stylized image is generated by combining $r\\_{c s}$ and rough stylized image $\\bar{x}\\_{c s}$. This procedure ensures that the distribution of global style pattern in $\\bar{x}\\_{c s}$ is properly kept. Meanwhile, learning to revise local style patterns with residual details image is easier for the Revision Network.\r\n\r\nAs shown in the Figure, the Revision Network is designed as a simple yet effective encoder-decoder architecture, with only one down-sampling and one up-sampling layer. Further, a [patch discriminator](https://paperswithcode.com/method/patchgan) is used to help Revision Network to capture fine patch textures under adversarial learning setting. The patch discriminator $D$ is defined following SinGAN, where $D$ owns 5 convolution layers and 32 hidden channels. A relatively shallow $D$ is chosen to (1) avoid overfitting since we only have one style image and (2) control the receptive field to ensure D can only capture local patterns.",
  "title": "Drafting and Revision: Laplacian Pyramid Network for Fast High-Quality Artistic Style Transfer",
  "collection": "Style Transfer Modules",
  "area": "Computer Vision"
}
{
  "name": "TABBIE",
  "full_name": "TABBIE",
  "description": "**TABBIE** is a pretraining objective (*corrupt cell detection*) that learns exclusively from tabular data. Unlike other approaches, TABBIE provides embeddings of all table substructures (cells, rows, and columns). TABBIE can be seen as a table embedding model trained to detect corrupted cells, inspired by the [ELECTRA](https://www.paperswithcode.com/method/electra) objective function.",
  "title": "TABBIE: Pretrained Representations of Tabular Data",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "PGM",
  "full_name": "Probability Guided Maxout",
  "description": "A regularization criterion that, differently from [dropout](https://paperswithcode.com/method/dropout) and its variants, is deterministic rather than random. It grounds on the empirical evidence that feature descriptors with larger L2-norm and highly-active nodes are strongly correlated to confident class predictions. Thus, the criterion guides towards dropping a percentage of the most active nodes of the descriptors, proportionally to the estimated class probability",
  "title": null,
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "RandSol",
  "full_name": "Randomized Adversarial Solarization",
  "description": "Attack on image classifiers by a image solarization through greedy random search.",
  "title": "Don't Look into the Sun: Adversarial Solarization Attacks on Image Classifiers",
  "collection": "Adversarial Attacks",
  "area": "General"
}
{
  "name": "Parametric UMAP",
  "full_name": "Parametric UMAP",
  "description": "**Parametric UMAP** is a non-parametric graph-based dimensionality reduction algorithm that extends the second step of [UMAP](https://www.paperswithcode.com/method/umap) to a parametric optimization over neural network weights, learning a parametric relationship between data and embedding.",
  "title": "Parametric UMAP embeddings for representation and semi-supervised learning",
  "collection": "Dimensionality Reduction",
  "area": "General"
}
{
  "name": "FCN",
  "full_name": "Fully Convolutional Network",
  "description": "**Fully Convolutional Networks**, or **FCNs**, are an architecture used mainly for semantic segmentation. They employ solely locally connected layers, such as [convolution](https://paperswithcode.com/method/convolution), pooling and upsampling. Avoiding the use of dense layers means less parameters (making the networks faster to train). It also means an FCN can work for variable image sizes given all connections are local.\r\n\r\nThe network consists of a downsampling path, used to extract and interpret the context, and an upsampling path, which allows for localization. \r\n\r\nFCNs also employ skip connections to recover the fine-grained spatial information lost in the downsampling path.",
  "title": "Fully Convolutional Networks for Semantic Segmentation",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "Pointer Sentinel-LSTM",
  "full_name": "Pointer Sentinel-LSTM",
  "description": "The **Pointer Sentinel-LSTM mixture model** is a type of recurrent neural network that combines the advantages of standard [softmax](https://paperswithcode.com/method/softmax) classifiers with those of a pointer component for effective and efficient language modeling. Rather than relying on the RNN hidden state to decide when to use the pointer, the model allows the pointer component itself to decide when to use the softmax vocabulary through a sentinel.",
  "title": "Pointer Sentinel Mixture Models",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "Neural Cache",
  "full_name": "Neural Cache",
  "description": "A **Neural Cache**, or a **Continuous Cache**, is a module for language modelling which stores previous hidden states in memory cells. They are then used as keys to retrieve their corresponding word, that is the next word. There is no transformation applied to the storage during writing and reading.\r\n\r\nMore formally it exploits the hidden representations $h\\_{t}$ to define a probability distribution over the words in the cache. As\r\nillustrated in the Figure, the cache stores pairs $\\left(h\\_{i}, x\\_{i+1}\\right)$ of a hidden representation, and the word which was generated based on this representation (the vector $h\\_{i}$ encodes the history $x\\_{i}, \\dots, x\\_{1}$). At time $t$, we then define a probability distribution over words stored in the cache based on the stored hidden representations and the current one $h\\_{t}$ as:\r\n\r\n$$ p\\_{cache}\\left(w | h\\_{1\\dots{t}}, x\\_{1\\dots{t}}\\right) \\propto \\sum^{t-1}\\_{i=1}\\mathcal{1}\\_{\\text{set}\\left(w=x\\_{i+1}\\right)} \\exp\\left(θ\\_{h}>h\\_{t}^{T}h\\_{i}\\right) $$\r\n\r\nwhere the scalar $\\theta$ is a parameter which controls the flatness of the distribution. When $\\theta$ is equal to zero, the probability distribution over the history is uniform, and the model is equivalent to a unigram cache model.",
  "title": "Improving Neural Language Models with a Continuous Cache",
  "collection": "Language Model Components",
  "area": "Natural Language Processing"
}
{
  "name": "Compact Global Descriptor",
  "full_name": "Compact Global Descriptor",
  "description": "A **Compact Global Descriptor** is an image model block for modelling interactions between positions across different dimensions (e.g., channels, frames). This descriptor enables subsequent convolutions to access the informative global features. It is a form of attention.",
  "title": "Compact Global Descriptor for Neural Networks",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "SentencePiece",
  "full_name": "SentencePiece",
  "description": "**SentencePiece** is a subword tokenizer and detokenizer for natural language processing. It performs subword segmentation, supporting the byte-pair-encoding ([BPE](https://paperswithcode.com/method/bpe)) algorithm and unigram language model, and then converts this text into an id sequence guarantee perfect reproducibility of the normalization and subword segmentation.",
  "title": "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing",
  "collection": "Tokenizers",
  "area": "Natural Language Processing"
}
{
  "name": "SKEP",
  "full_name": "SKEP",
  "description": "**SKEP** is a self-supervised pre-training method for sentiment analysis. With the help of automatically-mined knowledge, SKEP conducts sentiment masking and constructs three sentiment knowledge prediction objectives, so as to embed sentiment information at the word, polarity and aspect level into pre-trained sentiment representation. In particular, the prediction of aspect-sentiment pairs is converted into multi-label classification, aiming to capture the dependency between words in a pair.\r\n\r\nSKEP contains two parts: (1) Sentiment masking recognizes the sentiment information of an input sequence based on automatically-mined sentiment knowledge, and produces a corrupted version by removing these informations. (2) Sentiment pre-training objectives require the transformer to recover the removed information from the corrupted version. The three prediction objectives on top are jointly optimized: Sentiment Word (SW) prediction (on $\\left.\\mathrm{x}\\_{9}\\right)$, Word Polarity (SP) prediction (on $\\mathrm{x}\\_{6}$ and $\\mathbf{x}\\_{9}$ ), Aspect-Sentiment pairs (AP) prediction (on $\\mathbf{x}\\_{1}$ ). Here, the smiley denotes positive polarity. Notably, on $\\mathrm{x}\\_{6}$, only SP is calculated without SW, as its original word has been predicted in the pair prediction on $\\mathbf{x}\\_{1}$.",
  "title": "SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "ALDEN",
  "full_name": "ALDEN",
  "description": "**ALDEN**, or **Active Learning with DivErse iNterpretations**, is an active learning approach for text classification. With local interpretations in DNNs, ALDEN identifies linearly separable regions of samples. Then, it selects samples according to their diversity of local interpretations and queries their labels.\r\n\r\nSpecifically, we first calculate the local interpretations in DNN for each sample as the gradient backpropagated from the final\r\npredictions to the input features. Then, we use the most diverse interpretation of words in a sample to measure its diverseness. Accordingly, we select unlabeled samples with the maximally diverse interpretations for labeling and retrain the model with these\r\nlabeled samples.",
  "title": "Deep Active Learning for Text Classification with Diverse Interpretations",
  "collection": "Text Classification Models",
  "area": "Natural Language Processing"
}
{
  "name": "StruBERT",
  "full_name": "StruBERT: Structure-aware BERT for Table Search and Matching",
  "description": "A large amount of information is stored in data tables. Users can search for data tables using a keyword-based query. A table is composed primarily of data values that are organized in rows and columns providing implicit structural information. A table is usually accompanied by secondary information such as the caption, page title, etc., that form the textual information. Understanding the connection between the textual and structural information is an important yet neglected aspect in table retrieval as previous methods treat each source of information independently. In addition, users can search for data tables that are similar to an existing table, and this setting can be seen as a content-based table retrieval. In this paper, we propose StruBERT, a structure-aware BERT model that fuses the textual and structural information of a data table to produce context-aware representations for both textual and tabular content of a data table. StruBERT features are integrated in a new end-to-end neural ranking model to solve three table-related downstream tasks: keyword- and content-based table retrieval, and table similarity. We evaluate our approach using three datasets, and we demonstrate substantial improvements in terms of retrieval and classification metrics over state-of-the-art methods.",
  "title": "StruBERT: Structure-aware BERT for Table Search and Matching",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "Skip-gram Word2Vec",
  "full_name": "Skip-gram Word2Vec",
  "description": "**Skip-gram Word2Vec** is an architecture for computing word embeddings. Instead of using surrounding words to predict the center word, as with CBow Word2Vec, Skip-gram Word2Vec uses the central word to predict the surrounding words.\r\n\r\nThe skip-gram objective function sums the log probabilities of the surrounding $n$ words to the left and right of the target word $w\\_{t}$ to produce the following objective:\r\n\r\n$$J\\_\\theta = \\frac{1}{T}\\sum^{T}\\_{t=1}\\sum\\_{-n\\leq{j}\\leq{n}, \\neq{0}}\\log{p}\\left(w\\_{j+1}\\mid{w\\_{t}}\\right)$$",
  "title": "Efficient Estimation of Word Representations in Vector Space",
  "collection": "Word Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "Big-Little Net",
  "full_name": "Big-Little Net",
  "description": "**Big-Little Net** is a convolutional neural network architecture for learning multi-scale feature representations. This is achieved by using a multi-branch network, which has different computational complexity at different branches with different resolutions. Through frequent merging of features from branches at distinct scales, the model obtains multi-scale features while using less computation.\r\n\r\nIt consists of Big-Little Modules, which have two branches: each of which represents a separate block from a deep model and a less deep counterpart. The two branches are fused with linear combination + unit weights. These two branches are known as Big-Branch (more layers and channels at low resolutions) and Little-Branch (fewer layers and channels at high resolution).",
  "title": "Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "AffCorrs",
  "full_name": "Affordance Correspondence",
  "description": "Method for one-shot visual search of object parts / one-shot semantic part correspondence. Given a single reference image of an object with annotated affordance regions, it segments semantically corresponding parts within a target scene. AffCorrs is used to find corresponding affordances both for intra- and inter-class one-shot part segmentation.",
  "title": null,
  "collection": "Instance Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "Multiscale Dilated Convolution Block",
  "full_name": "Multiscale Dilated Convolution Block",
  "description": "A **Multiscale Dilated Convolution Block** is an Inception-style convolutional block motivated by the ideas that image features naturally occur at multiple scales, that a network’s expressivity is proportional to the range of functions it can represent divided by its total number of parameters, and by the desire to efficiently expand a network’s receptive field. The Multiscale [Dilated Convolution](https://paperswithcode.com/method/dilated-convolution) (MDC) block applies a single $F\\times{F}$ filter at multiple dilation factors, then performs a weighted elementwise sum of each dilated filter’s output, allowing the network to simultaneously learn a set of features and the relevant scales at which those features occur with a minimal increase in parameters. This also rapidly expands the network’s receptive field without requiring an increase in depth or the number of parameters.",
  "title": "Neural Photo Editing with Introspective Adversarial Networks",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Seq2Seq",
  "full_name": "Sequence to Sequence",
  "description": "**Seq2Seq**, or **Sequence To Sequence**, is a model used in sequence prediction tasks, such as language modelling and machine translation. The idea is to use one [LSTM](https://paperswithcode.com/method/lstm), the *encoder*, to read the input sequence one timestep at a time, to obtain a large fixed dimensional vector representation (a context vector), and then to use another LSTM, the *decoder*, to extract the output sequence\r\nfrom that vector. The second LSTM is essentially a recurrent neural network language model except that it is conditioned on the input sequence.\r\n\r\n(Note that this page refers to the original seq2seq not general sequence-to-sequence models)",
  "title": "Sequence to Sequence Learning with Neural Networks",
  "collection": "Sequence To Sequence Models",
  "area": "Sequential"
}
{
  "name": "Channel Squeeze and Spatial Excitation",
  "full_name": "Channel Squeeze and Spatial Excitation (sSE)",
  "description": "Inspired on the widely known [spatial squeeze and channel excitation (SE)](https://paperswithcode.com/method/squeeze-and-excitation-block) block, the sSE block performs channel squeeze and spatial excitation, to recalibrate the feature maps spatially and achieve more fine-grained image segmentation.",
  "title": "Recalibrating Fully Convolutional Networks with Spatial and Channel 'Squeeze & Excitation' Blocks",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "RoIWarp",
  "full_name": "RoIWarp",
  "description": "**Region of Interest Warping**, or **RoIWarp**, is a form of [RoIPool](https://paperswithcode.com/method/roi-pooling) that is differentiable with respect to the box position. In practice, this takes the form of a RoIWarp layer followed by a standard [Max Pooling](https://paperswithcode.com/method/max-pooling) layer. The RoIWarp layer crops a feature map region and warps it into a target size by interpolation.",
  "title": "Instance-aware Semantic Segmentation via Multi-task Network Cascades",
  "collection": "RoI Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "ZeRO-Offload",
  "full_name": "ZeRO-Offload",
  "description": "ZeRO-Offload is a sharded data parallel method for distributed training. It exploits both CPU memory and compute for offloading, while offering a clear path towards efficiently scaling on multiple GPUs by working with [ZeRO-powered data parallelism](https://www.paperswithcode.com/method/zero). The symbiosis allows ZeRO-Offload to maintain a single copy of the optimizer states on the CPU memory regardless of the data parallel degree. Furthermore, it keeps the aggregate communication volume between GPU and CPU, as well as the aggregate CPU computation a constant regardless of data parallelism, allowing ZeRO-Offload to effectively utilize the linear increase in CPU compute with the increase in the data parallelism degree.",
  "title": "ZeRO-Offload: Democratizing Billion-Scale Model Training",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "ReLU",
  "full_name": "Rectified Linear Units",
  "description": "**Rectified Linear Units**, or **ReLUs**, are a type of activation function that are linear in the positive dimension, but zero in the negative dimension. The kink in the function is the source of the non-linearity. Linearity in the positive dimension has the attractive property that it prevents non-saturation of gradients (contrast with [sigmoid activations](https://paperswithcode.com/method/sigmoid-activation)), although for half of the real line its gradient is zero.\r\n\r\n$$ f\\left(x\\right) = \\max\\left(0, x\\right) $$",
  "title": null,
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "InstaBoost",
  "full_name": "InstaBoost",
  "description": "**InstaBoost** is a data augmentation technique for instance segmentation that utilises existing instance mask annotations.\r\n\r\nIntuitively in a small neighbor area of $(x_0, y_0, 1, 0)$, the probability map $P(x, y, s, r)$ should be high-valued since images are usually continuous and redundant in pixel level. Based on this, InstaBoost is a form of augmentation where we apply object jittering that randomly samples transformation tuples from the neighboring space of identity transform $(x_0, y_0, 1, 0)$ and paste the cropped object following affine transform $\\mathbf{H}$.",
  "title": "InstaBoost: Boosting Instance Segmentation via Probability Map Guided Copy-Pasting",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "LAPGAN",
  "full_name": "LAPGAN",
  "description": "A **LAPGAN**, or **Laplacian Generative Adversarial Network**, is a type of generative adversarial network that has a [Laplacian pyramid](https://paperswithcode.com/method/laplacian-pyramid) representation. In the sampling procedure following training, we have a set of generative convnet models {$G\\_{0}, \\dots , G\\_{K}$}, each of which captures the distribution of coefficients $h\\_{k}$ for natural images at a different level of the Laplacian pyramid. Sampling an image is akin to a reconstruction procedure, except that the generative\r\nmodels are used to produce the $h\\_{k}$’s:\r\n\r\n$$ \\tilde{I}\\_{k} = u\\left(\\tilde{I}\\_{k+1}\\right) + \\tilde{h}\\_{k} = u\\left(\\tilde{I}\\_{k+1}\\right) + G\\_{k}\\left(z\\_{k}, u\\left(\\tilde{I}\\_{k+1}\\right)\\right)$$\r\n\r\nThe recurrence starts by setting $\\tilde{I}\\_{K+1} = 0$ and using the model at the final level $G\\_{K}$ to generate a residual image $\\tilde{I}\\_{K}$ using noise vector $z\\_{K}$: $\\tilde{I}\\_{K} = G\\_{K}\\left(z\\_{K}\\right)$. Models at all levels except the final are conditional generative models that take an upsampled version of the current image $\\tilde{I}\\_{k+1}$ as a conditioning variable, in addition to the noise vector $z\\_{k}$.\r\n\r\nThe generative models {$G\\_{0}, \\dots, G\\_{K}$} are trained using the CGAN approach at each level of the pyramid. Specifically, we construct a Laplacian pyramid from each training image $I$. At each level we make a stochastic choice (with equal probability) to either (i) construct the coefficients $h\\_{k}$ either using the standard Laplacian pyramid coefficient generation procedure or (ii) generate them using $G\\_{k}:\r\n\r\n$$ \\tilde{h}\\_{k} = G\\_{k}\\left(z\\_{k}, u\\left(I\\_{k+1}\\right)\\right) $$\r\n\r\nHere $G\\_{k}$ is a convnet which uses a coarse scale version of the image $l\\_{k} = u\\left(I\\_{k+1}\\right)$ as an input, as well as noise vector $z\\_{k}$. $D\\_{k}$ takes as input $h\\_{k}$ or $\\tilde{h}\\_{k}$, along with the low-pass image $l\\_{k}$ (which is explicitly added to $h\\_{k}$ or $\\tilde{h}\\_{k}$ before the first [convolution](https://paperswithcode.com/method/convolution) layer), and predicts if the image was real or\r\ngenerated. At the final scale of the pyramid, the low frequency residual is sufficiently small that it\r\ncan be directly modeled with a standard [GAN](https://paperswithcode.com/method/gan): $\\tilde{h}\\_{K} = G\\_{K}\\left(z\\_{K}\\right)$ and $D\\_{K}$ only has $h\\_{K}$ or $\\tilde{h}\\_{K}$ as input.\r\n\r\nBreaking the generation into successive refinements is the key idea. We give up any “global” notion of fidelity; an attempt is never made to train a network to discriminate between the output of a cascade and a real image and instead the focus is on making each step plausible.",
  "title": "Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "VOS",
  "full_name": "VOS",
  "description": "**VOS** is a type of video object segmentation model consisting of two network components. The target appearance model consists of a light-weight module, which is learned during the inference stage using fast optimization techniques to predict a coarse but robust target segmentation. The segmentation model is exclusively trained offline, designed to process the coarse scores into high quality segmentation masks.",
  "title": "Learning Fast and Robust Target Models for Video Object Segmentation",
  "collection": "Video Object Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "T-Fixup",
  "full_name": "T-Fixup",
  "description": "**T-Fixup** is an [initialization](https://paperswithcode.com/methods/category/initialization) method for [Transformers](https://paperswithcode.com/methods/category/transformers) that aims to remove the need for [layer normalization](https://paperswithcode.com/method/layer-normalization) and [warmup](https://paperswithcode.com/method/linear-warmup). The initialization procedure is as follows:\r\n\r\n- Apply [Xavier initialization](https://paperswithcode.com/method/xavier-initialization) for all parameters excluding input embeddings. Use Gaussian initialization $\\mathcal{N}\\left(0, d^{-\\frac{1}{2}}\\right)$ for input embeddings where $d$ is the embedding dimension.\r\n- Scale $\\mathbf{v}\\_{d}$ and $\\mathbf{w}\\_{d}$ matrices in each decoder [attention block](https://paperswithcode.com/method/multi-head-attention), weight matrices in each decoder [MLP block](https://paperswithcode.com/method/position-wise-feed-forward-layer) and input embeddings $\\mathbf{x}$ and $\\mathbf{y}$ in encoder and decoder by $(9 N)^{-\\frac{1}{4}}$\r\n- Scale $\\mathbf{v}\\_{e}$ and $\\mathbf{w}\\_{e}$ matrices in each encoder [attention block](https://paperswithcode.com/method/multi-head-attention) and weight matrices in each encoder [MLP block](https://paperswithcode.com/method/position-wise-feed-forward-layer) by $0.67 N^{-\\frac{1}{4}}$",
  "title": "Improving Transformer Optimization Through Better Initialization",
  "collection": "Initialization",
  "area": "General"
}
{
  "name": "CVRL",
  "full_name": "Contrastive Video Representation Learning",
  "description": "**Contrastive Video Representation Learning**, or **CVRL**, is a self-supervised contrastive learning framework for learning spatiotemporal visual representations from unlabeled videos. Representations are learned using a contrastive loss, where two clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away. Data augmentations are designed involving spatial and temporal cues. Concretely, a [temporally consistent spatial augmentation](https://paperswithcode.com/method/temporally-consistent-spatial-augmentation#) method is used to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames. A sampling-based temporal augmentation method is also used to avoid overly enforcing invariance on clips that are distant in time. \r\n\r\nEnd-to-end, from a raw video, we first sample a temporal interval from a monotonically decreasing distribution. The temporal interval represents the number of frames between the start points of two clips, and we sample two clips from a video according to this interval. Afterwards we apply a [temporally consistent spatial augmentation](https://paperswithcode.com/method/temporally-consistent-spatial-augmentation) to each of the clips and feed them into a 3D backbone with an MLP head. The contrastive loss is used to train the network to attract the clips from the same video and repel the clips from different videos in the embedding space.",
  "title": "Spatiotemporal Contrastive Video Representation Learning",
  "collection": "Generative Video Models",
  "area": "Computer Vision"
}
{
  "name": "Global Local Attention Module",
  "full_name": "Global Local Attention Module",
  "description": "The Global Local Attention Module (GLAM) is an image model block that attends to the feature map's channels and spatial dimensions locally, and also attends to the feature map's channels and spatial dimensions globally. The locally attended feature maps, globally attended feature maps, and the original feature maps are then fused through a weighted sum (with learnable weights) to obtain the final feature map.\r\n\r\nPaper:\r\n\r\nSong, C. H., Han, H. J., & Avrithis, Y. (2022). All the attention you need: Global-local, spatial-channel attention for image retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 2754-2763).",
  "title": "All the attention you need: Global-local, spatial-channel attention for image retrieval",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Triplet Entropy Loss",
  "full_name": "Triplet Entropy Loss",
  "description": "The Triplet Entropy Loss (TEL) training method aims to leverage both the strengths of Cross Entropy Loss (CEL) and [Triplet loss](https://paperswithcode.com/method/triplet-loss) during the training process, assuming that it would lead to better generalization. The TEL method though does not contain a pre-training step, but trains simultaneously with both CEL and Triplet losses.",
  "title": "Triplet Entropy Loss: Improving The Generalisation of Short Speech Language Identification Systems",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "PULSE",
  "full_name": "PULSE",
  "description": "**PULSE** is a self-supervised photo upsampling algorithm. Instead of starting with the LR image and slowly adding detail, PULSE traverses the high-resolution natural image manifold, searching for images that downscale to the original LR image. This is formalized through the downscaling loss, which guides exploration through the latent space of a generative model. By leveraging properties of high-dimensional Gaussians, the authors aim to restrict the search space to guarantee realistic outputs.",
  "title": "PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models",
  "collection": "Image Super-Resolution Models",
  "area": "Computer Vision"
}
{
  "name": "SPNet",
  "full_name": "Strip Pooling Network",
  "description": "Spatial pooling usually operates on a small region which limits its capability to capture long-range dependencies and focus on distant regions. To overcome this, Hou et al. proposed  strip pooling, a novel pooling method capable of encoding long-range context in either horizontal or vertical spatial domains.  \r\n\r\nStrip pooling has two branches for horizontal  and vertical strip pooling. The horizontal strip pooling part first pools the input feature $F \\in \\mathcal{R}^{C \\times H \\times W}$ in the horizontal direction:\r\n\\begin{align}\r\ny^1 = \\text{GAP}^w (X) \r\n\\end{align}\r\nThen a 1D convolution with kernel size 3 is applied in $y$ to capture the relationship between different rows and channels. This is repeated $W$ times to make  the output $y_v$  consistent with the input shape:\r\n\\begin{align}\r\n    y_h = \\text{Expand}(\\text{Conv1D}(y^1))\r\n\\end{align}\r\nVertical strip pooling is performed in a similar way. Finally, the outputs of the two branches are fused using element-wise summation to produce the attention map:\r\n\\begin{align}\r\ns &= \\sigma(Conv^{1\\times 1}(y_{v} + y_{h}))\r\n\\end{align}\r\n\\begin{align}\r\nY &= s  X\r\n\\end{align}\r\n\r\nThe strip pooling module (SPM) is further developed in the mixed pooling module (MPM). Both consider  spatial  and channel relationships to overcome the locality of convolutional neural networks.  SPNet achieves  state-of-the-art results for several complex semantic segmentation benchmarks.",
  "title": "Strip Pooling: Rethinking Spatial Pooling for Scene Parsing",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "OSCAR",
  "full_name": "OSCAR",
  "description": "OSCAR is a new learning method that uses object tags detected in images as anchor points to ease the learning of image-text alignment. The model take a triple as input (word-tag-region) and pre-trained with two losses (masked token loss over words and tags, and a contrastive loss between tags and others). OSCAR represents an image-text pair into semantic space via dictionary lookup. Object tags are used as anchor points to align image regions with word embeddings of pre-trained language models. The model is then fine-tuned for understanding and generation tasks.",
  "title": "Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "MFEC",
  "full_name": "Model-Free Episodic Control",
  "description": "Non-parametric approximation of Q-values by storing all visited states and doing inference through k-Nearest Neighbors.",
  "title": "Model-Free Episodic Control",
  "collection": "Non-Parametric Regression",
  "area": "General"
}
{
  "name": "GRLIA",
  "full_name": "GRLIA",
  "description": "**GRLIA** is an incident aggregation framework for online service systems based on graph representation learning over the cascading graph of cloud failures. A representation vector is learned for each unique type of incident in an unsupervised and unified manner, which is able to simultaneously encode the topological and temporal correlations among incidents.",
  "title": "Graph-based Incident Aggregation for Large-Scale Online Service Systems",
  "collection": "Incident Aggregation Models",
  "area": "General"
}
{
  "name": "Computation Redistribution",
  "full_name": "Computation Redistribution",
  "description": "**Computation Redistribution** is an [neural architecture search](https://paperswithcode.com/task/architecture-search) method for [face detection](https://paperswithcode.com/task/face-detection), which reallocates the computation between the backbone, neck and head of the model based on a predefined search methodology. Directly utilising the backbone of a classification network for scale-specific face detection can be sub-optimal. Therefore, [network structure search](https://paperswithcode.com/method/regnety) is used to reallocate the computation on the backbone, neck and head, under a wide range of flop regimes. The search method is applied to [RetinaNet](https://paperswithcode.com/method/retinanet), with [ResNet](https://paperswithcode.com/method/resnet) as backbone, [Path Aggregation Feature Pyramid Network](https://paperswithcode.com/method/pafpn) (PAFPN)  as the neck and stacked 3 × 3 [convolutional layers](https://paperswithcode.com/method/convolution) for the head. While the general structure is simple, the total number of possible networks in the search space is unwieldy. In the first step, the authors explore the reallocation of the computation within the backbone parts (i.e. stem, C2, C3, C4, and C5), while fixing the neck and head components. Based on the optimised computation distribution on the backbone they find, they further explore the reallocation of the computation across the backbone, neck and head.",
  "title": "Sample and Computation Redistribution for Efficient Face Detection",
  "collection": "Neural Architecture Search",
  "area": "General"
}
{
  "name": "Tacotron",
  "full_name": "Tacotron",
  "description": "**Tacotron** is an end-to-end generative text-to-speech model that takes a character sequence as input and outputs the corresponding spectrogram. The backbone of Tacotron is a seq2seq model with attention. The Figure depicts the model, which includes an encoder, an attention-based decoder, and a post-processing net. At a high-level, the model takes characters as input and produces spectrogram\r\nframes, which are then converted to waveforms.",
  "title": "Tacotron: Towards End-to-End Speech Synthesis",
  "collection": "Text-to-Speech Models",
  "area": "Audio"
}
{
  "name": "VATT",
  "full_name": "VATT",
  "description": "**Video-Audio-Text Transformer**, or **VATT**, is a framework for learning multimodal representations from unlabeled data using [convolution](https://paperswithcode.com/method/convolution)-free [Transformer](https://paperswithcode.com/method/transformer) architectures. Specifically, it takes raw signals as inputs and extracts multidimensional representations that are rich enough to benefit a variety of downstream tasks. VATT borrows the exact architecture from [BERT](https://paperswithcode.com/method/bert) and [ViT](https://paperswithcode.com/method/vision-transformer) except the layer of tokenization and linear projection reserved for each modality separately. The design follows the same spirit as ViT that makes the minimal changes to the architecture so that the learned model can transfer its weights to various frameworks and tasks.\r\n\r\nVATT linearly projects each modality into a feature vector and feeds it into a Transformer encoder. A semantically hierarchical common space is defined to account for the granularity of different modalities and noise contrastive estimation is employed to train the model.",
  "title": "VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "Location-based Attention",
  "full_name": "Location-based Attention",
  "description": "**Location-based Attention** is an attention mechanism in which the alignment scores are computed from solely the target hidden state $\\mathbf{h}\\_{t}$ as follows:\r\n\r\n$$ \\mathbf{a}\\_{t} = \\text{softmax}(\\mathbf{W}\\_{a}\\mathbf{h}_{t}) $$",
  "title": "Effective Approaches to Attention-based Neural Machine Translation",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "MSPFN",
  "full_name": "Multi-scale Progressive Fusion Network",
  "description": "**Multi-scale Progressive Fusion Network** (MSFPN) is a neural network representation for single image deraining. It aims to exploit the correlated information of rain streaks across scales for single image deraining. \r\n\r\nSpecifically, we first generate the Gaussian pyramid rain images using Gaussian kernels to down-sample the original rain image in sequence. A coarse-fusion module (CFM) is designed to capture the global texture information from these multi-scale rain images through recurrent calculation (Conv-[LSTM](https://paperswithcode.com/method/lstm)), thus enabling the network to cooperatively represent the target rain streak using similar counterparts from global feature space. Meanwhile, the representation of the high-resolution pyramid layer is guided by previous outputs as well as all low-resolution pyramid layers. A finefusion module (FFM) is followed to further integrate these correlated information from different scales. By using the channel attention mechanism, the network not only discriminatively learns the scale-specific knowledge from all preceding pyramid layers, but also reduces the feature redundancy effectively. Moreover, multiple FFMs can be cascaded to form a progressive multi-scale fusion. Finally, a reconstruction module (RM) is appended to aggregate the coarse and fine rain information extracted respectively from CFM and FFM for learning the residual rain image, which is the approximation of real rain streak distribution.",
  "title": "Multi-Scale Progressive Fusion Network for Single Image Deraining",
  "collection": "Deraining Models",
  "area": "Computer Vision"
}
{
  "name": "BLANC",
  "full_name": "BLANC",
  "description": "**BLANC** is an automatic estimation approach for document summary quality. The goal is to measure the functional performance of a summary with an objective, reproducible, and fully automated method. BLANC achieves this by measuring the performance boost gained by a pre-trained language model with access to a document summary while carrying out its language understanding task on the document's text.",
  "title": "Fill in the BLANC: Human-free quality estimation of document summaries",
  "collection": "Document Summary Evaluation",
  "area": "Natural Language Processing"
}
{
  "name": "PAFs",
  "full_name": "Part Affinity Fields",
  "description": "",
  "title": "Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields",
  "collection": "Output Functions",
  "area": "General"
}
{
  "name": "Motion Disentanglement",
  "full_name": "Self-Supervised Motion Disentanglement",
  "description": "A self-supervised learning method to disentangle irregular (anomalous) motion from regular motion in unlabeled videos.",
  "title": "Domain Knowledge-Informed Self-Supervised Representations for Workout Form Assessment",
  "collection": "Action Recognition Models",
  "area": "Computer Vision"
}
{
  "name": "DBGAN",
  "full_name": "Distribution-induced Bidirectional Generative Adversarial Network for Graph Representation Learning",
  "description": "DBGAN is a method for graph representation learning. Instead of the widely used normal distribution assumption, the prior distribution of latent representation in DBGAN is estimated in a structure-aware way, which implicitly bridges the graph and feature spaces by prototype learning.\r\n\r\nSource: [Distribution-induced Bidirectional Generative Adversarial Network for Graph Representation Learning](https://arxiv.org/abs/1912.01899)",
  "title": null,
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "Weight Standardization",
  "full_name": "Weight Standardization",
  "description": "**Weight Standardization** is a normalization technique that smooths the loss landscape by standardizing the weights in convolutional layers. Different from the previous normalization methods that focus on *activations*, WS considers the smoothing effects of *weights* more than just length-direction decoupling. Theoretically, WS reduces the Lipschitz constants of the loss and the gradients.\r\nHence, WS smooths the loss landscape and improves training.\r\n\r\nIn Weight Standardization, instead of directly optimizing the loss $\\mathcal{L}$ on the original weights $\\hat{W}$, we reparameterize the weights $\\hat{W}$ as a function of $W$, i.e. $\\hat{W}=\\text{WS}(W)$, and optimize the loss $\\mathcal{L}$ on $W$ by [SGD](https://paperswithcode.com/method/sgd):\r\n\r\n$$\r\n    \\hat{W} = \\Big[ \\hat{W}\\_{i,j}~\\big|~ \\hat{W}\\_{i,j} = \\dfrac{W\\_{i,j} - \\mu\\_{W\\_{i,\\cdot}}}{\\sigma\\_{W\\_{i,\\cdot}+\\epsilon}}\\Big]\r\n$$\r\n\r\n$$\r\n    y = \\hat{W}*x\r\n$$\r\n\r\nwhere\r\n\r\n$$\r\n    \\mu_{W\\_{i,\\cdot}} = \\dfrac{1}{I}\\sum\\_{j=1}^{I}W\\_{i, j},~~\\sigma\\_{W\\_{i,\\cdot}}=\\sqrt{\\dfrac{1}{I}\\sum\\_{i=1}^I(W\\_{i,j} - \\mu\\_{W\\_{i,\\cdot}})^2}\r\n$$\r\n\r\nSimilar to [Batch Normalization](https://paperswithcode.com/method/batch-normalization), WS controls the first and second moments of the weights of each output channel individually in convolutional layers. Note that many initialization methods also initialize the weights in some similar ways. Different from those methods, WS standardizes the weights in a differentiable way which aims to normalize gradients during back-propagation. Note that we do not have any affine transformation on $\\hat{W}$. This is because we assume that normalization layers such as BN or [GN](https://paperswithcode.com/method/group-normalization) will normalize this convolutional layer again.",
  "title": "Micro-Batch Training with Batch-Channel Normalization and Weight Standardization",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "HRI pipeline",
  "full_name": "Human Robot Interaction Pipeline",
  "description": "The pipeline we propose consists of three parts: 1) recognizing the interaction type; 2) detecting the object that the interaction is targeting; and 3) learning incrementally the models from data recorded by the robot sensors. Our main contributions lie in the target object detection, guided by the recognized interaction, and in the incremental object learning. The novelty of our approach is the focus on natural, heterogeneous, and multimodal HRIs to incrementally learn new object models.",
  "title": null,
  "collection": "Clustering",
  "area": "General"
}
{
  "name": "SoftPool",
  "full_name": "Soft Pooling",
  "description": "SoftPool: a fast and efficient method that sums exponentially weighted activations. Compared to a range of other pooling methods, SoftPool retains more information in the downsampled activation maps. More refined downsampling leads to better classification accuracy.",
  "title": "Refining activation downsampling with SoftPool",
  "collection": "Pooling Operations",
  "area": "Computer Vision"
}
{
  "name": "NAS-FCOS",
  "full_name": "NAS-FCOS",
  "description": "**NAS-FCOS** consists of two sub networks, an [FPN](https://paperswithcode.com/method/fpn) $f$ and a set of prediction heads $h$ which have shared structures. One notable difference with other FPN-based one-stage detectors is that our heads have partially shared weights. Only the last several layers of the predictions heads (marked as yellow) are tied by their weights. The number of layers to share is decided automatically by the search algorithm. Note that both FPN and head are in our actual search space; and have more layers than shown in this figure.",
  "title": "NAS-FCOS: Fast Neural Architecture Search for Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "CenterPoint",
  "full_name": "CenterPoint",
  "description": "**CenterPoint** is a two-stage 3D detector that finds centers of objects and their properties using a keypoint detector and regresses to other attributes, including 3D size, 3D orientation and velocity. In a second-stage, it refines these estimates using additional point features on the object. CenterPoint uses a standard Lidar-based backbone network, i.e., VoxelNet or PointPillars, to build a representation of the input point-cloud. CenterPoint predicts the relative offset (velocity) of objects between consecutive frames, which are then linked up greedily -- so in Centerpoint, 3D object tracking simplifies to greedy closest-point matching.",
  "title": "Center-based 3D Object Detection and Tracking",
  "collection": "3D Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "YOLOv1",
  "full_name": "YOLOv1",
  "description": "**YOLOv1** is a single-stage object detection model. Object detection is framed as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. \r\n\r\nThe network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means the network reasons globally about the full image and all the objects in the image.",
  "title": "You Only Look Once: Unified, Real-Time Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "GPT",
  "full_name": "GPT",
  "description": "**GPT** is a [Transformer](https://paperswithcode.com/method/transformer)-based architecture and training procedure for natural language processing tasks. Training follows a two-stage procedure. First, a language modeling objective is used on\r\nthe unlabeled data to learn the initial parameters of a neural network model. Subsequently, these parameters are adapted to a target task using the corresponding supervised objective.",
  "title": "Improving Language Understanding by Generative Pre-Training",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Spatial Transformer",
  "full_name": "Spatial Transformer",
  "description": "A **Spatial Transformer** is an image model block that explicitly allows the spatial manipulation of data within a [convolutional neural network](https://paperswithcode.com/methods/category/convolutional-neural-networks). It gives CNNs the ability to actively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimisation process. Unlike pooling layers, where the receptive fields are fixed and local, the spatial transformer module is a dynamic mechanism that can actively spatially transform an image (or a feature map) by producing an appropriate transformation for each input sample. The transformation is then performed on the entire feature map (non-locally) and can include scaling, cropping, rotations, as well as non-rigid deformations.\r\n\r\nThe architecture is shown in the Figure to the right. The input feature map $U$ is passed to a localisation network which regresses the transformation parameters $\\theta$. The regular spatial grid $G$ over $V$ is transformed to the sampling grid $T\\_{\\theta}\\left(G\\right)$, which is applied to $U$, producing the warped output feature map $V$. The combination of the localisation network and sampling mechanism defines a spatial transformer.",
  "title": "Spatial Transformer Networks",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "TABPFN",
  "full_name": "tabular data Prior-data Fitted Network",
  "description": "We present TabPFN, a trained Transformer that can do supervised classification for small tabular datasets in less than a second, needs no hyperparameter tuning and is competitive with state-of-the-art classification methods. TabPFN is fully entailed in the weights of our network, which accepts training and test samples as a set-valued input and yields predictions for the entire test set in a single forward pass. TabPFN is a Prior-Data Fitted Network (PFN) and is trained offline once, to approximate Bayesian inference on synthetic datasets drawn from our prior. This prior incorporates ideas from causal reasoning: It entails a large space of structural causal models with a preference for simple structures. On the 18 datasets in the OpenML-CC18 suite that contain up to 1 000 training data points, up to 100 purely numerical features without missing values, and up to 10 classes, we show that our method clearly outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with up to 230× speedup. This increases to a 5 700× speedup when using a GPU. We also validate these results on an additional 67 small numerical datasets from OpenML. We provide all our code, the trained TabPFN, an interactive browser demo and a Colab notebook at https://github.com/automl/TabPFN.",
  "title": "TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "WaveGrad",
  "full_name": "WaveGrad",
  "description": "**WaveGrad** is a conditional model for waveform generation through estimating gradients of the data density. This model is built on the prior work on score matching and diffusion probabilistic models. It starts from Gaussian white noise and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram. WaveGrad is non-autoregressive, and requires only a constant number of generation steps during inference. It can use as few as 6 iterations to generate high fidelity audio samples.",
  "title": "WaveGrad: Estimating Gradients for Waveform Generation",
  "collection": "Generative Audio Models",
  "area": "Audio"
}
{
  "name": "BiGG",
  "full_name": "BiGG",
  "description": "**BiGG** is an autoregressive model for generative modeling for sparse graphs. It utilizes sparsity to avoid generating the full adjacency matrix, and reduces the graph generation time complexity to $O(((n + m)\\log n)$. Furthermore, during training this autoregressive model can be parallelized with $O(\\log n)$ synchronization stages, which makes it much more efficient than other autoregressive models that require $\\Omega(n)$. The approach is based on three key elements: (1) an $O(\\log n)$ process for generating each edge using a binary tree data structure, inspired by R-MAT; (2) a tree-structured autoregressive model for generating the set of edges associated with each node; and (3) an autoregressive model defined over the sequence of nodes.",
  "title": "Scalable Deep Generative Modeling for Sparse Graphs",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "ConvNeXt",
  "full_name": "ConvNeXt",
  "description": "",
  "title": "A ConvNet for the 2020s",
  "collection": "Backbone Architectures",
  "area": "Computer Vision"
}
{
  "name": "wav2vec-U",
  "full_name": "wav2vec Unsupervised",
  "description": "**wav2vec-U** is an unsupervised method to train speech recognition models without any labeled data. It leverages self-supervised speech representations to segment unlabeled language and learn a mapping from these representations to phonemes via adversarial training. \r\n\r\nSpecifically, we learn self-supervised representations with wav2vec 2.0 on unlabeled speech audio, then identify clusters in the representations with k-means to segment the audio data. Next, we build segment representations by mean pooling the wav2vec 2.0 representations, performing [PCA](https://paperswithcode.com/method/pca) and a second mean pooling step between adjacent segments. This is input to the generator which outputs a phoneme sequence that is fed to the discriminator, similar to phonemized unlabeled text to perform adversarial training.",
  "title": "Unsupervised Speech Recognition",
  "collection": "Speech Recognition",
  "area": "Audio"
}
{
  "name": "MacBERT",
  "full_name": "MacBERT",
  "description": "**MacBERT** is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based model for Chinese NLP that alters [RoBERTa](https://paperswithcode.com/method/roberta) in several ways, including a modified masking strategy. Instead of masking with [MASK] token, which never appears in the fine-tuning stage, MacBERT masks the word with its similar word. Specifically MacBERT shares the same pre-training tasks as [BERT](https://paperswithcode.com/method/bert) with several modifications. For the MLM task, the following modifications are performed:\r\n\r\n- Whole word masking is used as well as Ngram masking strategies for selecting candidate tokens for masking, with a percentage of\r\n40%, 30%, 20%, 10% for word-level unigram to 4-gram.\r\n- Instead of masking with [MASK] token, which never appears in the fine-tuning stage, similar words are used for the masking purpose. A similar word is obtained by using Synonyms toolkit which is based on word2vec similarity calculations. If an N-gram is selected to mask, we will find similar words individually. In rare cases, when there is no similar word, we will degrade to use random word replacement.\r\n- A percentage of 15% input words is used for masking, where 80% will replace with similar words, 10% replace with a random word, and keep with original words for the rest of 10%.",
  "title": "Revisiting Pre-Trained Models for Chinese Natural Language Processing",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Natural Gradient Descent",
  "full_name": "Natural Gradient Descent",
  "description": "**Natural Gradient Descent** is an approximate second-order optimisation method. It has an interpretation as optimizing over a Riemannian manifold using an intrinsic distance metric, which implies the updates are invariant to transformations such as whitening. By using the positive semi-definite (PSD) Gauss-Newton matrix to approximate the (possibly negative definite) Hessian, NGD can often work better than exact second-order methods.\r\n\r\nGiven the gradient of $z$, $g = \\frac{\\delta{f}\\left(z\\right)}{\\delta{z}}$, NGD computes the update as:\r\n\r\n$$\\Delta{z} = \\alpha{F}^{−1}g$$\r\n\r\nwhere the Fisher information matrix $F$ is defined as:\r\n\r\n$$ F = \\mathbb{E}\\_{p\\left(t\\mid{z}\\right)}\\left[\\nabla\\ln{p}\\left(t\\mid{z}\\right)\\nabla\\ln{p}\\left(t\\mid{z}\\right)^{T}\\right] $$\r\n\r\nThe log-likelihood function $\\ln{p}\\left(t\\mid{z}\\right)$ typically corresponds to commonly used error functions such as the cross entropy loss.\r\n\r\nSource: [LOGAN](https://paperswithcode.com/method/logan)\r\n\r\nImage: [Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks\r\n](https://arxiv.org/abs/1905.10961)",
  "title": null,
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "NPID++",
  "full_name": "NPID++",
  "description": "**NPID++** (Non-Parametric Instance Discrimination) is a self-supervision approach that takes a non-parametric classification approach. It approves upon [NPID](https://paperswithcode.com/method/npid) by using more negative samples and training for more epochs.",
  "title": "Self-Supervised Learning of Pretext-Invariant Representations",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "FuseFormer Block",
  "full_name": "FuseFormer Block",
  "description": "A **FuseFormer block** is used in the [FuseFormer](https://paperswithcode.com/method/fuseformer) model for video inpainting. It is the same to standard [Transformer](https://paperswithcode.com/method/transformer) block except that feed forward network is replaced with a Fusion Feed Forward Network (F3N). F3N brings no extra parameter into the standard feed forward net and the difference is that F3N inserts a soft-split and a soft composite operation between the two layer of MLPs.",
  "title": "FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting",
  "collection": "Video Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "VAE",
  "full_name": "Variational Autoencoder",
  "description": "A **Variational Autoencoder** is a type of likelihood-based generative model. It consists of an encoder, that takes in data $x$ as input and transforms this into a latent representation $z$,  and a decoder, that takes a latent representation $z$ and returns a reconstruction $\\hat{x}$. Inference is performed via variational inference to approximate the posterior of the model.",
  "title": "Auto-Encoding Variational Bayes",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Sarsa Lambda",
  "full_name": "Sarsa Lambda",
  "description": "**Sarsa_INLINE_MATH_1** extends eligibility-traces to action-value methods. It has the same update rule as for **TD_INLINE_MATH_1** but we use the action-value form of the TD erorr:\r\n\r\n$$ \\delta\\_{t} = R\\_{t+1} + \\gamma\\hat{q}\\left(S\\_{t+1}, A\\_{t+1}, \\mathbb{w}\\_{t}\\right) - \\hat{q}\\left(S\\_{t}, A\\_{t}, \\mathbb{w}\\_{t}\\right) $$\r\n\r\nand the action-value form of the [eligibility trace](https://paperswithcode.com/method/eligibility-trace):\r\n\r\n$$ \\mathbb{z}\\_{-1} = \\mathbb{0} $$\r\n\r\n$$ \\mathbb{z}\\_{t} = \\gamma\\lambda\\mathbb{z}\\_{t-1} + \\nabla\\hat{q}\\left(S\\_{t}, A\\_{t}, \\mathbb{w}\\_{t} \\right), 0 \\leq t \\leq T$$\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition",
  "title": null,
  "collection": "On-Policy TD Control",
  "area": "Reinforcement Learning"
}
{
  "name": "RelDiff",
  "full_name": "RelDiff",
  "description": "RelDiff generates entity-relation-entity embeddings in a single embedding space. RelDiff adopts two fundamental vector algebraic operators to transform entity and relation embeddings from knowledge graphs into entity-relation-entity embeddings. In particular, RelDiff can encode finer-grained information about the relations than is captured when separate embeddings are learned for the entities and the relations.",
  "title": "RelDiff: Enriching Knowledge Graph Relation Representations for Sensitivity Classification",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "Adversarial Soft Advantage Fitting (ASAF)",
  "full_name": "Adversarial Soft Advantage Fitting (ASAF)",
  "description": "",
  "title": "Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization",
  "collection": "Adversarial Training",
  "area": "General"
}
{
  "name": "ConvMLP",
  "full_name": "ConvMLP",
  "description": "**ConvMLP** is a hierarchical convolutional MLP for visual recognition, which consists of a stage-wise, co-design of [convolution](https://paperswithcode.com/method/convolution) layers, and MLPs. The Conv Stage consists of $C$ convolutional blocks with $1\\times 1$ and $3\\times 3$ kernel sizes. It is repeated $M$ times before a down convolution is utilized to express a level $L$. The MLP-Conv Stage consists of Channelwise MLPs, with skip layers, and a [depthwise convolution](https://paperswithcode.com/method/depthwise-convolution). This is repeated $M$ times before a down convolution is utilized to express a level $\\mathcal{L}$.",
  "title": "ConvMLP: Hierarchical Convolutional MLPs for Vision",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "NPID",
  "full_name": "NPID",
  "description": "**NPID** (Non-Parametric Instance Discrimination) is a self-supervision approach that takes a non-parametric classification approach. Noise contrastive estimation is used to learn representations. Specifically, distances (similarity) between instances are calculated directly from the features in a non-parametric way.",
  "title": "Unsupervised Feature Learning via Non-Parametric Instance Discrimination",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "Deep LSTM Reader",
  "full_name": "Deep LSTM Reader",
  "description": "The **Deep LSTM Reader** is a neural network for reading comprehension. We feed documents one word at a time into a Deep [LSTM](https://paperswithcode.com/method/lstm) encoder, after a delimiter we then also feed the query into the encoder. The model therefore processes each document query pair as a single long sequence. Given the embedded document and query the network predicts which token in the document answers the query.\r\n\r\nThe model consists of a Deep LSTM cell with skip connections from each input $x\\left(t\\right)$ to every hidden layer, and from every hidden layer to the output $y\\left(t\\right)$:\r\n\r\n$$x'\\left(t, k\\right) = x\\left(t\\right)||y'\\left(t, k - 1\\right) \\text{,  } y\\left(t\\right) = y'\\left(t, 1\\right)|| \\dots ||y'\\left(t, K\\right) $$\r\n\r\n$$ i\\left(t, k\\right) =  \\left(W\\_{kxi}x'\\left(t, k\\right) + W\\_{khi}h(t - 1, k) + W\\_{kci}c\\left(t - 1, k\\right) + b\\_{ki}\\right) $$\r\n\r\n$$ f\\left(t, k\\right) =  \\left(W\\_{kxf}x\\left(t\\right) + W\\_{khf}h\\left(t - 1, k\\right) + W\\_{kcf}c\\left(t - 1, k\\right) + b\\_{kf}\\right) $$\r\n\r\n$$ c\\left(t, k\\right) = f\\left(t, k\\right)c\\left(t - 1, k\\right) + i\\left(t, k\\right)\\text{tanh}\\left(W\\_{kxc}x'\\left(t, k\\right) + W\\_{khc}h\\left(t -  1, k\\right) + b\\_{kc}\\right) $$\r\n\r\n$$ o\\left(t, k\\right) =  \\left(W\\_{kxo}x'\\left(t, k\\right) + W\\_{kho}h\\left(t - 1, k\\right) + W\\_{kco}c\\left(t, k\\right) + b\\_{ko}\\right) $$\r\n\r\n$$ h\\left(t, k\\right) = o\\left(t, k\\right)\\text{tanh}\\left(c\\left(t, k\\right)\\right) $$\r\n\r\n$$ y'\\left(t, k\\right) = W\\_{kyh}\\left(t, k\\right) + b\\_{ky} $$\r\n\r\nwhere || indicates vector concatenation, $h\\left(t, k\\right)$ is the hidden state for layer $k$ at time $t$, and $i$, $f$, $o$ are the input, forget, and output gates respectively. Thus our Deep LSTM Reader is defined by $g^{\\text{LSTM}}\\left(d, q\\right) = y\\left(|d|+|q|\\right)$ with input $x\\left(t\\right)$ the concatenation of $d$ and $q$ separated by the delimiter |||.",
  "title": "Teaching Machines to Read and Comprehend",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "IMGEP",
  "full_name": "Intrinsically Motivated Goal Exploration Processes",
  "description": "Population-based intrinsically motivated goal exploration algorithms applied to real world robot learning of complex skills like tool use.",
  "title": "Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "Eligibility Trace",
  "full_name": "Eligibility Trace",
  "description": "An **Eligibility Trace** is a memory vector $\\textbf{z}\\_{t} \\in \\mathbb{R}^{d}$ that parallels the long-term weight vector $\\textbf{w}\\_{t} \\in \\mathbb{R}^{d}$. The idea is that when a component of $\\textbf{w}\\_{t}$ participates in producing an estimated value, the corresponding component of $\\textbf{z}\\_{t}$ is bumped up and then begins to fade away. Learning will then occur in that component of $\\textbf{w}\\_{t}$ if a nonzero TD error occurs before the trade falls back to zero. The trace-decay parameter $\\lambda \\in \\left[0, 1\\right]$ determines the rate at which the trace falls.\r\n\r\nIntuitively, they tackle the credit assignment problem by capturing both a frequency heuristic - states that are visited more often deserve more credit - and a recency heuristic - states that are visited more recently deserve more credit.\r\n\r\n$$E\\_{0}\\left(s\\right) = 0 $$\r\n$$E\\_{t}\\left(s\\right) = \\gamma\\lambda{E}\\_{t-1}\\left(s\\right) + \\textbf{1}\\left(S\\_{t} = s\\right) $$\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition",
  "title": null,
  "collection": "Eligibility Traces",
  "area": "Reinforcement Learning"
}
{
  "name": "CPC v2",
  "full_name": "CPC v2",
  "description": "**Contrastive Predictive Coding v2 (CPC v2)** is a self-supervised learning approach that builds upon the original [CPC](https://paperswithcode.com/method/contrastive-predictive-coding) with several improvements. These improvements include:\r\n\r\n- **Model capacity** - The third residual stack of [ResNet](https://paperswithcode.com/method/resnet)-101 (originally containing 23 blocks, 1024-dimensional feature maps, and 256-dimensional bottleneck layers), is converted to use 46 blocks, with 4096-dimensional feature maps and 512-dimensional bottleneck layers: ResNet-161.\r\n\r\n- **Layer Normalization** - The authors find CPC with [batch normalization](https://paperswithcode.com/method/batch-normalization) harms downstream performance. They hypothesize this is due to batch normalization allowing large models to find a trivial solution to CPC: it introduces a dependency between patches (through the batch statistics) that can be exploited to bypass the constraints on the receptive field. They replace batch normalization with [layer normalization](https://paperswithcode.com/method/layer-normalization).\r\n\r\n- **Predicting lengths and directions** - patches are predicted with contexts from both directions rather than just spatially underneath.\r\n\r\n- **Patch-based Augmentation** - Utilising \"color dropping\" which randomly drops two of the three color channels in each patch, as well as random horizontal flips.\r\n\r\n\r\nConsistent with prior results, this new architecture delivers better performance regardless of",
  "title": "Data-Efficient Image Recognition with Contrastive Predictive Coding",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "HaloNet",
  "full_name": "HaloNet",
  "description": "A **HaloNet** is a self-attention based model for efficient image classification. It relies on a local self-attention architecture that efficiently maps to existing hardware with haloing. The formulation breaks translational equivariance, but the authors observe that it improves  throughput and accuracies over the centered local self-attention used in regular self-attention. The approach also utilises a strided self-attentive downsampling operation for multi-scale feature extraction.",
  "title": "Scaling Local Self-Attention for Parameter Efficient Visual Backbones",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "CornerNet-Squeeze Hourglass",
  "full_name": "CornerNet-Squeeze Hourglass",
  "description": "**CornerNet-Squeeze Hourglass** is a convolutional neural network and object detection backbone used in the [CornerNet-Squeeze](https://paperswithcode.com/method/cornernet-squeeze) object detector. It uses a modified [hourglass module](https://paperswithcode.com/method/hourglass-module) that makes use of a [fire module](https://paperswithcode.com/method/fire-module): containing 1x1 convolutions and depthwise convolutions.",
  "title": "CornerNet-Lite: Efficient Keypoint Based Object Detection",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "DenseNAS",
  "full_name": "DenseNAS",
  "description": "**DenseNAS** is a [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) method that utilises a densely connected search space. The search space is represented as a dense super network, which is built upon designed routing blocks. In the super network, routing blocks are densely connected and we search for the best path between them to derive the final architecture. A chained cost estimation algorithm is used to approximate the model cost during the search.",
  "title": "Densely Connected Search Space for More Flexible Neural Architecture Search",
  "collection": "Neural Architecture Search",
  "area": "General"
}
{
  "name": "CP N3",
  "full_name": "CP with N3 Regularizer",
  "description": "CP with N3 Regularizer",
  "title": "Canonical Tensor Decomposition for Knowledge Base Completion",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "H-BEMD",
  "full_name": "Hue — Bi-Dimensional Empirical Mode Decomposition",
  "description": "",
  "title": "Deep Learning for Landslide Recognition in Satellite Architecture",
  "collection": "Image Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "CLIP",
  "full_name": "Contrastive Language-Image Pre-training",
  "description": "**Contrastive Language-Image Pre-training** (**CLIP**), consisting of a simplified version of ConVIRT trained from scratch, is an efficient method of image representation learning from natural language supervision. , CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes. \r\n\r\nFor pre-training, CLIP is trained to predict which of the $N X N$ possible (image, text) pairings across a batch actually occurred. CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the $N$ real pairs in the batch while minimizing the cosine similarity of the embeddings of the $N^2 - N$ incorrect pairings. A symmetric cross entropy loss is optimized over these similarity scores. \r\n\r\nImage credit: [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/pdf/2103.00020.pdf)",
  "title": "Learning Transferable Visual Models From Natural Language Supervision",
  "collection": "Image Representations",
  "area": "Computer Vision"
}
{
  "name": "BiGCN",
  "full_name": "Bi-Directional Graph Convolutional Network",
  "description": "",
  "title": "Rumor Detection on Social Media with Bi-Directional Graph Convolutional Networks",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "MAS",
  "full_name": "Mixing Adam and SGD",
  "description": "This optimizer mix [ADAM](https://paperswithcode.com/method/adam) and [SGD](https://paperswithcode.com/method/sgd) creating the MAS optimizer.",
  "title": "Mixing ADAM and SGD: a Combined Optimization Method",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Adaptive Softmax",
  "full_name": "Adaptive Softmax",
  "description": "**Adaptive Softmax** is a speedup technique for the computation of probability distributions over words. The adaptive [softmax](https://paperswithcode.com/method/softmax) is inspired by the class-based [hierarchical softmax](https://paperswithcode.com/method/hierarchical-softmax), where the word classes are built to minimize the computation time. Adaptive softmax achieves efficiency by explicitly taking into account the computation time of matrix-multiplication on parallel systems and combining it with a few important observations, namely keeping a shortlist of frequent words in the root node\r\nand reducing the capacity of rare words.",
  "title": "Efficient softmax approximation for GPUs",
  "collection": "Output Functions",
  "area": "General"
}
{
  "name": "GraphSAINT",
  "full_name": "Graph sampling based inductive learning method",
  "description": "Scalable method to train large scale GNN models via sampling small subgraphs.",
  "title": "GraphSAINT: Graph Sampling Based Inductive Learning Method",
  "collection": "Graph Representation Learning",
  "area": "Graphs"
}
{
  "name": "ScanSSD",
  "full_name": "ScanSSD",
  "description": "**ScanSSD** is a single-shot Detector ([SSD](https://paperswithcode.com/method/ssd)) for locating math formulas offset from text and embedded in textlines. It uses only visual features for detection: no formatting or typesetting information such as layout, font, or character labels are employed. Given a 600 dpi document page image, a Single Shot Detector (SSD) locates formulas at multiple scales using sliding windows, after which candidate detections are pooled to obtain page-level results.",
  "title": "ScanSSD: Scanning Single Shot Detector for Mathematical Formulas in PDF Document Images",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "SENet",
  "full_name": "SENet",
  "description": "A **SENet** is a convolutional neural network architecture that employs squeeze-and-excitation blocks to enable the network to perform dynamic channel-wise feature recalibration.",
  "title": "Squeeze-and-Excitation Networks",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Adaptive Instance Normalization",
  "full_name": "Adaptive Instance Normalization",
  "description": "**Adaptive Instance Normalization** is a normalization method that aligns the mean and variance of the content features with those of the style features. \r\n\r\n[Instance Normalization](https://paperswithcode.com/method/instance-normalization) normalizes the input to a single style specified by the affine parameters. Adaptive Instance Normaliation is an extension. In AdaIN, we receive a content input $x$ and a style input $y$, and we simply align the channel-wise mean and variance of $x$ to match those of $y$. Unlike [Batch Normalization](https://paperswithcode.com/method/batch-normalization), Instance Normalization or [Conditional Instance Normalization](https://paperswithcode.com/method/conditional-instance-normalization), AdaIN has no learnable affine parameters. Instead, it adaptively computes the affine parameters from the style input:\r\n\r\n$$\r\n\\textrm{AdaIN}(x, y)= \\sigma(y)\\left(\\frac{x-\\mu(x)}{\\sigma(x)}\\right)+\\mu(y)\r\n$$",
  "title": "Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "MLP-Mixer",
  "full_name": "MLP-Mixer",
  "description": "The **MLP-Mixer** architecture (or “Mixer” for short) is an image architecture that doesn't use convolutions or self-attention. Instead, Mixer’s architecture is based entirely on multi-layer perceptrons (MLPs) that are repeatedly applied across either spatial locations or feature channels. Mixer relies only on basic matrix multiplication routines, changes to data layout (reshapes and transpositions), and scalar nonlinearities.\r\n\r\nIt accepts a sequence of linearly projected image patches (also referred to as tokens) shaped as a “patches × channels” table as an input, and maintains this dimensionality. Mixer makes use of two types of MLP layers: channel-mixing MLPs and token-mixing MLPs. The channel-mixing MLPs allow communication between different channels; they operate on each token independently and take individual rows of the table as inputs. The token-mixing MLPs allow communication between different spatial locations (tokens); they operate on each channel independently and take individual columns of the table as inputs. These two types of layers are interleaved to enable interaction of both input dimensions.",
  "title": "MLP-Mixer: An all-MLP Architecture for Vision",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "RAN",
  "full_name": "Residual Attention Network",
  "description": "Inspired by the success of ResNet,\r\nWang et al. proposed\r\nthe very deep convolutional residual attention network (RAN) by \r\ncombining an attention mechanism with residual connections. \r\n\r\nEach attention module stacked in a residual attention network \r\ncan be divided into a mask branch and a trunk branch. \r\nThe trunk branch processes features,\r\nand can be implemented by any state-of-the-art structure\r\nincluding a pre-activation residual unit and an inception block.\r\nThe mask branch uses a bottom-up top-down structure\r\nto learn a mask of the same size that \r\nsoftly weights output features from the trunk branch. \r\nA sigmoid layer normalizes the output to $[0,1]$ after two $1\\times 1$ convolution layers. Overall the residual attention mechanism can be written as\r\n\r\n\\begin{align}\r\ns &= \\sigma(Conv_{2}^{1\\times 1}(Conv_{1}^{1\\times 1}( h_\\text{up}(h_\\text{down}(X))))) \r\n\\end{align}\r\n\r\n\\begin{align}\r\nX_{out} &= s f(X) + f(X)\r\n\\end{align}\r\nwhere $h_\\text{up}$ is a bottom-up structure, \r\nusing max-pooling several times after residual units\r\nto increase the receptive field, while\r\n$h_\\text{down}$ is the top-down part using \r\nlinear interpolation to keep the output size the \r\nsame as the input feature map. \r\nThere are also skip-connections between the two parts,\r\nwhich are omitted from the formulation.\r\n$f$ represents the trunk branch\r\nwhich can be any state-of-the-art structure.\r\n\r\nInside each attention module, a\r\nbottom-up top-down feedforward structure models\r\nboth spatial and cross-channel dependencies, \r\n leading to a consistent performance improvement. \r\nResidual attention can be incorporated into\r\nany deep network structure in an end-to-end training fashion.\r\nHowever, the proposed bottom-up top-down structure fails to leverage global spatial information.  \r\nFurthermore, directly predicting a 3D attention map  has high computational cost.",
  "title": "Residual Attention Network for Image Classification",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Fast_BAT",
  "full_name": "Fast Bi-level Adversarial Training",
  "description": "Fast-BAT is a new method for accelerated adversarial training.",
  "title": "Revisiting and Advancing Fast Adversarial Training Through The Lens of Bi-Level Optimization",
  "collection": "Adversarial Training",
  "area": "General"
}
{
  "name": "Logistic Regression",
  "full_name": "Logistic Regression",
  "description": "**Logistic Regression**, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.\r\n\r\nSource: [scikit-learn](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)\r\n\r\nImage: [Michaelg2015](https://commons.wikimedia.org/wiki/User:Michaelg2015)",
  "title": null,
  "collection": "Generalized Linear Models",
  "area": "General"
}
{
  "name": "Flow Normalization",
  "full_name": "Flow Normalization",
  "description": "",
  "title": "What's in the Flow? Exploiting Temporal Motion Cues for Unsupervised Generic Event Boundary Detection",
  "collection": "Heuristic Search Algorithms",
  "area": "Reinforcement Learning"
}
{
  "name": "Rotary Embeddings",
  "full_name": "Rotary Position Embedding",
  "description": "**Rotary Position Embedding**, or **RoPE**, is a type of position embedding which encodes absolute positional information with rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation. Notably, RoPE comes with valuable properties such as flexibility of being expand to any sequence lengths, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear self-attention with relative position encoding.",
  "title": "RoFormer: Enhanced Transformer with Rotary Position Embedding",
  "collection": "Position Embeddings",
  "area": "General"
}
{
  "name": "Fixup Initialization",
  "full_name": "Fixup Initialization",
  "description": "**FixUp Initialization**, or **Fixed-Update Initialization**, is an initialization method that rescales the standard initialization of [residual branches](https://paperswithcode.com/method/residual-block) by adjusting for the network architecture. Fixup aims to enables training very deep [residual networks](https://paperswithcode.com/method/resnet) stably at a maximal learning rate without [normalization](https://paperswithcode.com/methods/category/normalization).\r\n\r\nThe steps are as follows:\r\n\r\n1. Initialize the classification layer and the last layer of each residual branch to 0.\r\n\r\n2. Initialize every other layer using a standard method, e.g. [Kaiming Initialization](https://paperswithcode.com/method/he-initialization), and scale only the weight layers inside residual branches by $L^{\\frac{1}{2m-2}}$.\r\n\r\n3. Add a scalar multiplier (initialized at 1) in every branch and a scalar bias (initialized at 0) before each [convolution](https://paperswithcode.com/method/convolution), [linear](https://paperswithcode.com/method/linear-layer), and element-wise activation layer.",
  "title": "Fixup Initialization: Residual Learning Without Normalization",
  "collection": "Initialization",
  "area": "General"
}
{
  "name": "TrIVD-GAN",
  "full_name": "TrIVD-GAN",
  "description": "**TrIVD-GAN**, or **Transformation-based & TrIple Video Discriminator GAN**, is a type of generative adversarial network for video generation that builds upon [DVD-GAN](https://paperswithcode.com/method/dvd-gan). Improvements include a novel transformation-based recurrent unit (the TSRU) that makes the generator more expressive, and an improved discriminator architecture. \r\n\r\nIn contrast with DVD-[GAN](https://paperswithcode.com/method/gan), TrIVD-GAN has an alternative split for the roles of the discriminators, with $\\mathcal{D}\\_{S}$ judging per-frame global structure, while $\\mathcal{D}\\_{T}$ critiques local spatiotemporal structure. This is achieved by downsampling the $k$ randomly sampled frames fed to $\\mathcal{D}\\_{S}$ by a factor $s$, and cropping $T \\times H/s \\times W/s$ clips inside the high resolution video fed to $\\mathcal{D}\\_{T}$, where $T, H, W, C$ correspond to time, height, width and channel dimension of the input. This further reduces the number of pixels to process per video,\r\nfrom $k \\times H \\times W + T \\times H/s \\times W/s$ to $\\left(k + T\\right) \\times H/s \\times W/s$.",
  "title": "Transformation-based Adversarial Video Prediction on Large-Scale Data",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "HINT",
  "full_name": "Hierarchical Information Threading",
  "description": "An unsupervised approach for identifying Hierarchical Information Threads by analysing the network of related articles in a collection. In particular, HINT leverages article timestamps and the 5W1H questions to identify related articles about an event or discussion. HINT then constructs a network representation of the articles,\r\nand identify threads as strongly connected hierarchical network communities.",
  "title": "Effective Hierarchical Information Threading Using Network Community Detection",
  "collection": null,
  "area": null
}
{
  "name": "PSANet",
  "full_name": "PSANet",
  "description": "**PSANet** is a semantic segmentation architecture that utilizes a [Point-wise Spatial Attention](https://paperswithcode.com/method/point-wise-spatial-attention) (PSA) module to aggregate long-range contextual information in a flexible and adaptive manner. Each position in the feature map is connected with all other ones through self-adaptively predicted attention maps, thus harvesting various information nearby and far away. Furthermore, the authors design the bi-directional information propagation path for a comprehensive understanding of complex scenes. Each position collects information from all others to help the prediction of itself and vice versa, the information at each position can be distributed globally, assisting the prediction of all other positions. Finally, the bi-directionally aggregated contextual information is fused with local features to form the final representation of complex scenes.\r\n\r\nThe authors use [ResNet](https://paperswithcode.com/method/resnet) as an [FCN](https://paperswithcode.com/method/fcn) backbone for PSANet, as the Figure to the right illustrates. The proposed PSA module is then used to aggregate long-range contextual information from the local representation. It follows stage-5 in ResNet, which is the final stage of the FCN backbone. Features in stage-5 are semantically stronger. Aggregating them together leads to a more comprehensive representation of long-range context. Moreover, the spatial size of the feature map at stage-5 is smaller and can reduce computation overhead and memory consumption. An auxiliary loss branch is applied apart from the main loss.",
  "title": "PSANet: Point-wise Spatial Attention Network for Scene Parsing",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "SkipInit",
  "full_name": "SkipInit",
  "description": "**SkipInit** is a method that aims to allow [normalization](https://paperswithcode.com/methods/category/normalization)-free training of neural networks by downscaling [residual branches](https://paperswithcode.com/method/residual-block) at initialization.  This is achieved by including a learnable scalar multiplier at the end of each residual branch, initialized to $\\alpha$.\r\n\r\nThe method is motivated by theoretical findings that [batch normalization](https://paperswithcode.com/method/batch-normalization) downscales the hidden activations on the residual branch by a factor on the order of the square root of the network depth (at initialization). Therefore, as the depth of a residual network is increased, the residual blocks are increasingly dominated by the [skip connection](https://paperswithcode.com/method/residual-connection), which drives the functions computed by residual blocks closer to the identity, preserving signal propagation and ensuring well-behaved gradients. This leads to the proposed method which can achieve this property through an [initialization](https://paperswithcode.com/methods/category/initialization) strategy rather than a [normalization](https://paperswithcode.com/methods/category/normalization) strategy.",
  "title": "Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks",
  "collection": "Initialization",
  "area": "General"
}
{
  "name": "VLG-Net",
  "full_name": "Video Language Graph Matching Network",
  "description": "VLG-Net leverages recent advantages in Graph Neural Networks (GCNs) and leverages a novel multi-modality graph-based fusion method for the task of natural language video grounding.",
  "title": "VLG-Net: Video-Language Graph Matching Network for Video Grounding",
  "collection": "Video-Text Retrieval Models",
  "area": "Computer Vision"
}
{
  "name": "Depth-wise Plane Sweeping",
  "full_name": "Depth-wise Plane Sweeping",
  "description": "",
  "title": "DSGN++: Exploiting Visual-Spatial Relation for Stereo-based 3D Detectors",
  "collection": "3D Reconstruction",
  "area": "Computer Vision"
}
{
  "name": "Adversarial Color Enhancement",
  "full_name": "Adversarial Color Enhancement",
  "description": "**Adversarial Color Enhancement** is an approach to generating unrestricted adversarial images by optimizing a color filter via gradient descent.",
  "title": "Adversarial Color Enhancement: Generating Unrestricted Adversarial Images by Optimizing a Color Filter",
  "collection": "Adversarial Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "GroupDNet",
  "full_name": "Group Decreasing Network",
  "description": "**Group Decreasing Network**, or **GroupDNet**, is a type of convolutional neural network for multi-modal image synthesis. GroupDNet contains one encoder and one decoder. Inspired by the idea of [VAE](https://paperswithcode.com/method/vae) and SPADE, the encoder $E$ produces a\r\nlatent code $Z$ that is supposed to follow a Gaussian distribution $\\mathcal{N}(0,1)$ during training. While testing, the encoder $E$ is discarded. A randomly sampled code from the Gaussian distribution substitutes for $Z$. To fulfill this, the re-parameterization trick is used to enable a differentiable loss function during training. Specifically, the encoder predicts a mean vector and a variance vector through two fully connected layers to represent the encoded distribution. The gap between the encoded distribution and Gaussian distribution can be minimized by imposing a KL-divergence loss.",
  "title": "Semantically Multi-modal Image Synthesis",
  "collection": "Image Generation Models",
  "area": "Computer Vision"
}
{
  "name": "ADMM",
  "full_name": "Alternating Direction Method of Multipliers",
  "description": "The **alternating direction method of multipliers** (**ADMM**) is an algorithm that solves convex optimization problems by breaking them into smaller pieces, each of which are then easier to handle. It takes the form of a decomposition-coordination procedure, in which the solutions to small\r\nlocal subproblems are coordinated to find a solution to a large global problem. ADMM can be viewed as an attempt to blend the benefits of dual decomposition and augmented Lagrangian methods for constrained optimization. It turns out to be equivalent or closely related to many other algorithms\r\nas well, such as Douglas-Rachford splitting from numerical analysis, Spingarn’s method of partial inverses, Dykstra’s alternating projections method, Bregman iterative algorithms for l1 problems in signal processing, proximal methods, and many others.\r\n\r\nText Source: [https://stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf](https://stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf)\r\n\r\nImage Source: [here](https://www.slideshare.net/derekcypang/alternating-direction)",
  "title": null,
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "Galactica",
  "full_name": "Galactica",
  "description": "Galactica is a language model which uses a Transformer architecture in a decoder-only setup with the following modifications:\r\n\r\n- It uses GeLU activations on all model sizes\r\n- It uses a 2048 length context window for all model sizes\r\n- It does not use biases in any of the dense kernels or layer norms\r\n- It uses learned positional embeddings for the model\r\n- A vocabulary of 50k tokens was constructed using BPE. The vocabulary was generated from a randomly selected 2% subset of the training data",
  "title": "Galactica: A Large Language Model for Science",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "Two-Way Dense Layer",
  "full_name": "Two-Way Dense Layer",
  "description": "**Two-Way Dense Layer** is an image model block used in the [PeleeNet](https://paperswithcode.com/method/peleenet) architectures. Motivated by [GoogLeNet](https://paperswithcode.com/method/googlenet), the 2-way dense layer is used to get different scales of receptive fields. One way of the layer uses a 3x3 kernel size. The other way of the layer uses two stacked 3x3 [convolution](https://paperswithcode.com/method/convolution) to learn visual patterns for large objects.",
  "title": "Pelee: A Real-Time Object Detection System on Mobile Devices",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "Ape-X DQN",
  "full_name": "Ape-X DQN",
  "description": "**Ape-X DQN** is a variant of a [DQN](https://paperswithcode.com/method/dqn) with some components of [Rainbow-DQN](https://paperswithcode.com/method/rainbow-dqn) that utilizes distributed [prioritized experience replay](https://paperswithcode.com/method/prioritized-experience-replay) through the [Ape-X](https://paperswithcode.com/method/ape-x) architecture.",
  "title": "Distributed Prioritized Experience Replay",
  "collection": "Q-Learning Networks",
  "area": "Reinforcement Learning"
}
{
  "name": "ReGaDa",
  "full_name": "Residual gating mechanism to compose adverb-action representations",
  "description": "",
  "title": "Video-adverb retrieval with compositional adverb-action embeddings",
  "collection": "Video-Text Retrieval Models",
  "area": "Computer Vision"
}
{
  "name": "ODL",
  "full_name": "online deep learning",
  "description": "Deep Neural Networks (DNNs) are typically trained by backpropagation in a batch learning setting, which requires the entire training data to be made available prior to the learning task. This is not scalable for many real-world scenarios where new data arrives sequentially in a stream form. We aim to address an open challenge of \"Online Deep Learning\" (ODL) for learning DNNs on the fly in an online setting. Unlike traditional online learning that often optimizes some convex objective function with respect to a shallow model (e.g., a linear/kernel-based hypothesis), ODL is significantly more challenging since the optimization of the DNN objective function is non-convex, and regular backpropagation does not work well in practice, especially for online learning settings.",
  "title": "Online Deep Learning: Learning Deep Neural Networks on the Fly",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "GLIDE",
  "full_name": "Guided Language to Image Diffusion for Generation and Editing",
  "description": "GLIDE is a generative model based on text-guided diffusion models for more photorealistic image generation. Guided diffusion is applied to text-conditional image synthesis and the model is able to handle free-form prompts. The diffusion model uses a text encoder to condition on natural language descriptions. The model is provided with editing capabilities in addition to zero-shot generation, allowing for iterative improvement of model samples to match more complex prompts. The model is fine-tuned to perform image inpainting.",
  "title": "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models",
  "collection": "Multi-Modal Methods",
  "area": "Computer Vision"
}
{
  "name": "Dilated Bottleneck Block",
  "full_name": "Dilated Bottleneck Block",
  "description": "**Dilated Bottleneck Block** is an image model block used in the [DetNet](https://paperswithcode.com/method/detnet) convolutional neural network architecture. It employs a bottleneck structure with dilated convolutions to efficiently enlarge the receptive field.",
  "title": "DetNet: A Backbone network for Object Detection",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "BiFPN",
  "full_name": "BiFPN",
  "description": "A **BiFPN**, or **Weighted Bi-directional Feature Pyramid Network**, is a type of feature pyramid network which allows easy and fast multi-scale feature fusion. It incorporates the multi-level feature fusion idea from [FPN](https://paperswithcode.com/method/fpn), [PANet](https://paperswithcode.com/method/panet) and [NAS-FPN](https://paperswithcode.com/method/nas-fpn) that enables information to flow in both the top-down and bottom-up directions, while using regular and efficient connections. It also utilizes a fast normalized fusion technique. Traditional approaches usually treat all features input to the FPN equally, even those with different resolutions. However, input features at different resolutions often have unequal contributions to the output features. Thus, the BiFPN adds an additional weight for each input feature allowing the network to learn the importance of each. All regular convolutions are also replaced with less expensive depthwise separable convolutions.\r\n\r\nComparing with PANet, PANet added an extra bottom-up path for information flow at the expense of more computational cost. Whereas BiFPN optimizes these cross-scale connections by removing nodes with a single input edge, adding an extra edge from the original input to output node if they are on the same level, and treating each bidirectional path as one feature network layer (repeating it several times for more high-level future fusion).",
  "title": "EfficientDet: Scalable and Efficient Object Detection",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "Masked Convolution",
  "full_name": "Masked Convolution",
  "description": "A **Masked Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) which masks certain pixels so that the model can only predict based on pixels already seen. This type of convolution was introduced with [PixelRNN](https://paperswithcode.com/method/pixelrnn) generative models, where an image is generated pixel by pixel, to ensure that the model was conditional only on pixels already visited.",
  "title": "Pixel Recurrent Neural Networks",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "AutoParsimony",
  "full_name": "Automatic Search for Parsimonious Models",
  "description": "The principle of parsimony, also known as Occam's razor, elucidates the preference for the simplest explanation that provides optimal results when faced with multiple options. Thus, we can assert that the principle of parsimony is justified by \"the assumption that is both the simplest and contains all the necessary information required to comprehend the experiment at hand.\" This principle finds application in various scenarios or events in our daily lives, including predictions in Data Science models.\r\n\r\nIt is widely recognized that a less complex model will produce more stable predictions, exhibit greater resilience to noise and disturbances, and be more manageable for maintenance and analysis. Additionally, reducing the number of features can lead to further cost savings by diminishing the use of sensors, lowering energy consumption, minimizing information acquisition costs, reducing maintenance requirements, and mitigating the necessity to retrain models due to feature fluctuations caused by noise, outliers, data drift, etc.\r\n\r\nThe concurrent optimization of hyperparameters (HO) and feature selection (FS) for achieving Parsimonious Model Selection (PMS) is an ongoing area of active research. Nonetheless, the effective selection of appropriate hyperparameters and feature subsets presents a challenging combinatorial problem, frequently requiring the application of efficient heuristic methods.",
  "title": null,
  "collection": "AutoML",
  "area": "General"
}
{
  "name": "SHAP",
  "full_name": "Shapley Additive Explanations",
  "description": "**SHAP**, or **SHapley Additive exPlanations**, is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions. Shapley values are approximating using Kernel SHAP, which uses a weighting kernel for the approximation, and DeepSHAP, which uses DeepLift to approximate them.",
  "title": "A Unified Approach to Interpreting Model Predictions",
  "collection": "Interpretability",
  "area": "General"
}
{
  "name": "Self-adaptive Training",
  "full_name": "Self-adaptive Training",
  "description": "**Self-adaptive Training** is a training algorithm that dynamically corrects problematic training labels by model predictions to improve generalization of deep learning for potentially corrupted training data. Accumulated predictions are used to augment the training dynamics. The use of an exponential-moving-average scheme alleviates the instability issue of model predictions, smooths out the training target during the training process and enables the algorithm to completely change the training labels if necessary.",
  "title": "Self-Adaptive Training: beyond Empirical Risk Minimization",
  "collection": "Robust Training",
  "area": "General"
}
{
  "name": "Position-Sensitive RoI Pooling",
  "full_name": "Position-Sensitive RoI Pooling",
  "description": "**Position-Sensitive RoI Pooling layer** aggregates the outputs of the last convolutional layer and generates scores for each RoI. Unlike [RoI Pooling](https://paperswithcode.com/method/roi-pooling), PS RoI Pooling conducts selective pooling, and each of the $k$ × $k$ bin aggregates responses from only one score map out of the bank of $k$ × $k$ score maps. With end-to-end training, this RoI layer shepherds the last convolutional layer to learn specialized position-sensitive score maps.",
  "title": "R-FCN: Object Detection via Region-based Fully Convolutional Networks",
  "collection": "RoI Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "Vision Transformer",
  "full_name": "Vision Transformer",
  "description": "The **Vision Transformer**, or **ViT**, is a model for image classification that employs a [Transformer](https://paperswithcode.com/method/transformer)-like architecture over patches of the image.  An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard [Transformer](https://paperswithcode.com/method/transformer) encoder. In order to perform classification, the standard approach of adding an extra learnable “classification token” to the sequence is used.",
  "title": "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "BAGUA",
  "full_name": "BAGUA",
  "description": "**BAGUA** is a communication framework whose design goal is to provide a system abstraction that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training. The abstraction goes beyond parameter server and Allreduce paradigms, and provides a collection of MPI-style collective operations to facilitate communications with different precision and centralization strategies.",
  "title": "BAGUA: Scaling up Distributed Learning with System Relaxations",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "DGCNN",
  "full_name": "Deep Graph Convolutional Neural Network",
  "description": "DGCNN involves neural networks that read the graphs directly and learn a classification function. There are two main challenges: 1) how to extract useful features characterizing the rich information encoded in a graph for classification purpose, and 2) how to sequentially read a graph in a meaningful and consistent order. To address the first challenge, we design a localized graph convolution model and show its connection with two graph kernels. To address the second challenge, we design a novel SortPooling layer which sorts graph vertices in a consistent order so that traditional neural networks can be trained on the graphs.\r\n\r\nDescription and image from: [An End-to-End Deep Learning Architecture for Graph Classification](https://muhanzhang.github.io/papers/AAAI_2018_DGCNN.pdf)",
  "title": "An End-to-End Deep Learning Architecture for Graph Classification",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "Phish",
  "full_name": "Phish: A Novel Hyper-Optimizable Activation Function",
  "description": "Deep-learning models estimate values using backpropagation. The activation function within hidden layers is a critical component to minimizing loss in deep neural-networks. Rectified Linear (ReLU) has been the dominant activation function for the past decade. Swish and Mish are newer activation functions that have shown to yield better results than ReLU given specific circumstances. Phish is a novel activation function proposed here. It is a composite function defined as f(x) = xTanH(GELU(x)), where no discontinuities are apparent in the differentiated graph on the domain observed. Generalized networks were constructed using different activation functions. SoftMax was the output function. Using images from MNIST and CIFAR-10 databanks, these networks were trained to minimize sparse categorical crossentropy. A large scale cross-validation was simulated using stochastic Markov chains to account for the law of large numbers for the probability values. Statistical tests support the research hypothesis stating Phish could outperform other activation functions in classification. Future experiments would involve testing Phish in unsupervised learning algorithms and comparing it to more activation functions.",
  "title": "Phish: A Novel Hyper-Optimizable Activation Function",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Local Importance-based Pooling",
  "full_name": "Local Importance-based Pooling",
  "description": "**Local Importance-based Pooling (LIP)** is a pooling layer that can enhance discriminative features during the downsampling procedure by learning adaptive importance weights based on inputs. By using a learnable network $G$ in $F$, the importance function now is not limited in hand-crafted forms and able to learn the criterion for the discriminativeness of features. Also, the window size of LIP is restricted to be not less than stride to fully utilize the feature map and avoid the issue of fixed interval sampling scheme. More specifically, the importance function in LIP is implemented by a tiny fully convolutional network, which learns to produce the importance map based on inputs in an end-to-end manner.",
  "title": "LIP: Local Importance-based Pooling",
  "collection": "Pooling Operations",
  "area": "Computer Vision"
}
{
  "name": "MUSIQ",
  "full_name": "MUSIQ",
  "description": "**MUSIQ**, or **Multi-scale Image Quality Transformer**, is a [Transformer](https://paperswithcode.com/method/transformer)-based model for multi-scale image quality assessment. It processes native resolution images with varying sizes and aspect ratios. In MUSIQ, we construct a multi-scale image representation as input, including the native resolution image and its ARP resized variants.  Each image is split into fixed-size patches which are embedded by a patch encoding module (blue boxes). To capture 2D structure of the image and handle images of varying aspect ratios, the spatial embedding is encoded by hashing the patch position $(i,j)$ to $(t_{i},t_{j})$ within a grid of learnable embeddings (red boxes). Scale Embedding (green boxes) is introduced to capture scale information. The Transformer encoder takes the input tokens and performs multi-head self-attention. To predict the image quality, MUSIQ follows a common strategy in Transformers to add an [CLS] token to the sequence to represent the whole multi-scale input and the corresponding Transformer output is used as the final representation.",
  "title": "MUSIQ: Multi-scale Image Quality Transformer",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "Darknet-19",
  "full_name": "Darknet-19",
  "description": "**Darknet-19** is a convolutional neural network that is used as the backbone of [YOLOv2](https://paperswithcode.com/method/yolov2).  Similar to the [VGG](https://paperswithcode.com/method/vgg) models it mostly uses $3 \\times 3$ filters and doubles the number of channels after every pooling step. Following the work on Network in Network (NIN) it uses [global average pooling](https://paperswithcode.com/method/global-average-pooling) to make predictions as well as $1 \\times 1$ filters to compress the feature representation between $3 \\times 3$ convolutions. [Batch Normalization](https://paperswithcode.com/method/batch-normalization) is used to stabilize training, speed up convergence, and regularize the model batch.",
  "title": "YOLO9000: Better, Faster, Stronger",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Fraternal Dropout",
  "full_name": "Fraternal Dropout",
  "description": "**Fraternal Dropout** is a regularization method for recurrent neural networks that trains two identical copies of an RNN (that share parameters) with different [dropout](https://paperswithcode.com/method/dropout) masks while minimizing the difference between their (pre-[softmax](https://paperswithcode.com/method/softmax)) predictions. This encourages the representations of RNNs to be invariant to dropout mask, thus being robust.",
  "title": "Fraternal Dropout",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "O-Net",
  "full_name": "O-Net",
  "description": "",
  "title": "COVIR: A virtual rendering of a novel NN architecture O-Net for COVID-19 Ct-scan automatic lung lesions segmentation",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "R-Mix",
  "full_name": "Random Mix-up",
  "description": "R-Mix (Random Mix-up) is a Mix-up family Data Augmentation method. It combines random Mix-up with Saliency-guided mix-up, producing a procedure that is fast and performant, while reserving good characteristics of Saliency-guided Mix-up such as low Expected Calibration Error and high Weakly-supervised Object Localization accuracy.",
  "title": "Expeditious Saliency-guided Mix-up through Random Gradient Thresholding",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "NT-ASGD",
  "full_name": "Non-monotonically Triggered ASGD",
  "description": "**NT-ASGD**, or **Non-monotonically Triggered ASGD**, is an averaged stochastic gradient descent technique. \r\n\r\nIn regular ASGD, we take steps identical to [regular SGD](https://paperswithcode.com/method/sgd) but instead of returning the last iterate as the solution, we return $\\frac{1}{\\left(K-T+1\\right)}\\sum^{T}\\_{i=T}w\\_{i}$, where $K$ is the total number of iterations and $T < K$ is a user-specified averaging trigger.\r\n\r\nNT-ASGD has a non-monotonic criterion that conservatively triggers the averaging when the validation metric fails to improve for multiple cycles. Given that the choice of triggering is irreversible, this conservatism ensures that the randomness of training does not play a major role in the decision.",
  "title": "Regularizing and Optimizing LSTM Language Models",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Guided Anchoring",
  "full_name": "Guided Anchoring",
  "description": "**Guided Anchoring** is an anchoring scheme for object detection which leverages semantic features to guide the anchoring. The method is motivated by the observation that objects are not distributed evenly over the image. The scale of an object is also closely related to the imagery content, its location and geometry of the scene. Following this intuition, the method generates sparse anchors in two steps: first identifying sub-regions that may contain objects and then determining the shapes at different locations.",
  "title": "Region Proposal by Guided Anchoring",
  "collection": "Anchor Generation Modules",
  "area": "Computer Vision"
}
{
  "name": "Xception",
  "full_name": "Xception",
  "description": "**Xception** is a convolutional neural network architecture that relies solely on [depthwise separable convolution](https://paperswithcode.com/method/depthwise-separable-convolution) layers.",
  "title": "Xception: Deep Learning With Depthwise Separable Convolutions",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Spatial Attention Module",
  "full_name": "Spatial Attention Module",
  "description": "A **Spatial Attention Module** is a module for spatial attention in convolutional neural networks. It generates a spatial attention map by utilizing the inter-spatial relationship of features. Different from the [channel attention](https://paperswithcode.com/method/channel-attention-module), the spatial attention focuses on where is an informative part, which is complementary to the channel attention. To compute the spatial attention, we first apply average-pooling and max-pooling operations along the channel axis and concatenate them to generate an efficient feature descriptor. On the concatenated feature descriptor, we apply a [convolution](https://paperswithcode.com/method/convolution) layer to generate a spatial attention map $\\textbf{M}\\_{s}\\left(F\\right) \\in \\mathcal{R}^{H×W}$ which encodes where to emphasize or suppress. \r\n\r\nWe aggregate channel information of a feature map by using two pooling operations, generating two 2D maps: $\\mathbf{F}^{s}\\_{avg} \\in \\mathbb{R}^{1\\times{H}\\times{W}}$ and $\\mathbf{F}^{s}\\_{max} \\in \\mathbb{R}^{1\\times{H}\\times{W}}$. Each denotes average-pooled features and max-pooled features across the channel. Those are then concatenated and convolved by a standard convolution layer, producing the 2D spatial attention map. In short, the spatial attention is computed as:\r\n\r\n$$ \\textbf{M}\\_{s}\\left(F\\right) = \\sigma\\left(f^{7x7}\\left(\\left[\\text{AvgPool}\\left(F\\right);\\text{MaxPool}\\left(F\\right)\\right]\\right)\\right) $$\r\n\r\n$$ \\textbf{M}\\_{s}\\left(F\\right) = \\sigma\\left(f^{7x7}\\left(\\left[\\mathbf{F}^{s}\\_{avg};\\mathbf{F}^{s}\\_{max} \\right]\\right)\\right) $$\r\n\r\nwhere $\\sigma$ denotes the sigmoid function and $f^{7×7}$ represents a convolution operation with the filter size of 7 × 7.",
  "title": "CBAM: Convolutional Block Attention Module",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "CW-ERM",
  "full_name": "Closed-loop Weighted Empirical Risk Minimization",
  "description": "A closed-loop evaluation procedure is first used in a simulator to identify training data samples that are important for practical driving performance and then we these samples to help debias the policy network.",
  "title": "CW-ERM: Improving Autonomous Driving Planning with Closed-loop Weighted Empirical Risk Minimization",
  "collection": "Robust Training",
  "area": "General"
}
{
  "name": "EBM",
  "full_name": "energy-based model",
  "description": "",
  "title": "A Theory of Generative ConvNet",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "ELR",
  "full_name": "Early Learning Regularization",
  "description": "",
  "title": "Early-Learning Regularization Prevents Memorization of Noisy Labels",
  "collection": "Robust Training",
  "area": "General"
}
{
  "name": "Fast-OCR",
  "full_name": "Fast-OCR",
  "description": "Fast-OCR is a new lightweight detection network that incorporates features from existing models focused on the speed/accuracy trade-off, such as [YOLOv2](https://paperswithcode.com/method/yolov2), [CR-NET](https://paperswithcode.com/method/cr-net), and Fast-[YOLOv4](https://paperswithcode.com/method/yolov4).",
  "title": "Towards Image-based Automatic Meter Reading in Unconstrained Scenarios: A Robust and Efficient Approach",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "NoisyNet-DQN",
  "full_name": "NoisyNet-DQN",
  "description": "**NoisyNet-DQN** is a modification of a [DQN](https://paperswithcode.com/method/dqn) that utilises noisy linear layers for exploration instead of $\\epsilon$-greedy exploration as in the original DQN formulation.",
  "title": "Noisy Networks for Exploration",
  "collection": "Q-Learning Networks",
  "area": "Reinforcement Learning"
}
{
  "name": "FPN",
  "full_name": "Feature Pyramid Network",
  "description": "A **Feature Pyramid Network**, or **FPN**, is a feature extractor that takes a single-scale image of an arbitrary size as input, and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion. This process is independent of the backbone convolutional architectures. It therefore acts as a generic solution for building feature pyramids inside deep convolutional networks to be used in tasks like object detection.\r\n\r\nThe construction of the pyramid involves a bottom-up pathway and a top-down pathway.\r\n\r\nThe bottom-up pathway is the feedforward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. For the feature\r\npyramid, one pyramid level is defined for each stage. The output of the last layer of each stage is used as a reference set of feature maps. For [ResNets](https://paperswithcode.com/method/resnet) we use the feature activations output by each stage’s last [residual block](https://paperswithcode.com/method/residual-block). \r\n\r\nThe top-down pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. These features are then enhanced with features from the bottom-up pathway via lateral connections. Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway. The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times.",
  "title": "Feature Pyramid Networks for Object Detection",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "AutoEncoder",
  "full_name": "AutoEncoder",
  "description": "An **Autoencoder** is a bottleneck architecture that turns a high-dimensional input into a latent low-dimensional code (encoder), and then performs a reconstruction of the input with this latent code (the decoder).\r\n\r\nImage: [Michael Massi](https://en.wikipedia.org/wiki/Autoencoder#/media/File:Autoencoder_schema.png)",
  "title": "Reducing the Dimensionality of Data with Neural Networks",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "One-Shot Aggregation",
  "full_name": "One-Shot Aggregation",
  "description": "**One-Shot Aggregation** is an image model block that is an alternative to [Dense Blocks](https://paperswithcode.com/method/dense-block), by aggregating intermediate features. It is proposed as part of the [VoVNet](https://paperswithcode.com/method/vovnet) architecture. Each [convolution](https://paperswithcode.com/method/convolution) layer is connected by two-way connection. One way is connected to the subsequent layer to produce the feature with a larger receptive field while the other way is aggregated only once into the final output feature map. The difference with [DenseNet](https://paperswithcode.com/method/densenet) is that the output of each layer is not routed to all subsequent intermediate layers which makes the input size of intermediate layers constant.",
  "title": "An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "IRN",
  "full_name": "Invertible Rescaling Network",
  "description": "An **Invertible Rescaling Network (IRN)** is a network for image rescaling.  According to the Nyquist-Shannon sampling theorem, high-frequency contents are lost during downscaling. Ideally, we hope to keep all lost information to perfectly recover the original HR image, but storing or transferring the high-frequency information is unacceptable. In order to well address this challenge, the Invertible Rescaling Net (IRN) captures some knowledge on the lost information in the form of its distribution and embeds it into model’s parameters to mitigate the ill-posedness. Given an HR image $x$, IRN not only downscales it into a LR image y, but also embeds the case-specific high-frequency content into an auxiliary case-agnostic latent variable $z$, whose marginal distribution\r\nobeys a fixed pre-specified distribution (e.g., isotropic Gaussian). Based on this model,\r\nwe use a randomly drawn sample of $z$ from the pre-specified distribution for the inverse upscaling procedure, which holds the most information that one could have in upscaling.",
  "title": "Invertible Image Rescaling",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "EWC",
  "full_name": "Elastic Weight Consolidation",
  "description": "The methon to overcome catastrophic forgetting in neural network while continual learning",
  "title": "Overcoming catastrophic forgetting in neural networks",
  "collection": "Active Learning",
  "area": "General"
}
{
  "name": "SCN",
  "full_name": "Self-Cure Network",
  "description": "**Self-Cure Network**, or **SCN**, is a method for suppressing uncertainties for large-scale facial expression recognition, prventing deep networks from overfitting uncertain facial images. Specifically, SCN suppresses the uncertainty from two different aspects: 1) a self-attention mechanism over mini-batch to weight each training sample with a ranking regularization, and 2) a careful relabeling mechanism to modify the labels of these samples in the lowest-ranked group.",
  "title": "Suppressing Uncertainties for Large-Scale Facial Expression Recognition",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Switch FFN",
  "full_name": "Switch FFN",
  "description": "A **Switch FFN** is a sparse layer that operates independently on tokens within an input sequence. It is shown in the blue block in the figure. We diagram two tokens ($x\\_{1}$ = “More” and $x\\_{2}$ = “Parameters” below) being routed (solid lines) across four FFN experts, where the router independently routes each token. The switch FFN layer returns the output of the selected FFN multiplied by the router gate value (dotted-line).",
  "title": "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity",
  "collection": "Feedforward Networks",
  "area": "General"
}
{
  "name": "Z-PNN",
  "full_name": "Pansharpening by convolutional neural networks in the full resolution framework",
  "description": "In recent years, there has been a growing interest on deep learning-based pansharpening.\r\nResearch has mainly focused on architectures.\r\nHowever, lacking a ground truth, model training is also a major issue.\r\nA popular approach is to train networks in a reduced resolution domain, using the original data as ground truths.\r\nThe trained networks are then used on full resolution data, relying on an implicit scale invariance hypothesis.\r\nResults are generally good at reduced resolution, but more questionable at full resolution.\r\n\r\nHere, we propose a full-resolution training framework for deep learning-based pansharpening.\r\nTraining takes place in the high resolution domain, relying only on the original data, with no loss of information.\r\nTo ensure spectral and spatial fidelity, suitable losses are defined,\r\nwhich force the pansharpened output to be consistent with the available panchromatic and multispectral input.\r\nExperiments carried out on WorldView-3, WorldView-2, and GeoEye-1 images show that methods trained with the proposed framework\r\nguarantee an excellent performance in terms of both full-resolution numerical indexes and visual quality.\r\nThe framework is fully general, and can be used to train and fine-tune any deep learning-based pansharpening network.",
  "title": "Pansharpening by convolutional neural networks in the full resolution framework",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Inception-ResNet-v2 Reduction-B",
  "full_name": "Inception-ResNet-v2 Reduction-B",
  "description": "**Inception-ResNet-v2 Reduction-B** is an image model block used in the [Inception-ResNet-v2](https://paperswithcode.com/method/inception-resnet-v2) architecture.",
  "title": "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "ResNet",
  "full_name": "Residual Network",
  "description": "**Residual Networks**, or **ResNets**, learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. Instead of hoping each few stacked layers directly fit a desired underlying mapping, residual nets let these layers fit a residual mapping. They stack [residual blocks](https://paperswithcode.com/method/residual-block) ontop of each other to form network: e.g. a ResNet-50 has fifty layers using these blocks. \r\n\r\nFormally, denoting the desired underlying mapping as $\\mathcal{H}(x)$, we let the stacked nonlinear layers fit another mapping of $\\mathcal{F}(x):=\\mathcal{H}(x)-x$. The original mapping is recast into $\\mathcal{F}(x)+x$.\r\n\r\nThere is empirical evidence that these types of network are easier to optimize, and can gain accuracy from considerably increased depth.",
  "title": "Deep Residual Learning for Image Recognition",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "RMSProp",
  "full_name": "RMSProp",
  "description": "**RMSProp** is an unpublished adaptive learning rate optimizer [proposed by Geoff Hinton](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf). The motivation is that the magnitude of gradients can differ for different weights, and can change during learning, making it hard to choose a single global learning rate. RMSProp tackles this by keeping a moving average of the squared gradient and adjusting the weight updates by this magnitude. The gradient updates are performed as:\r\n\r\n$$E\\left[g^{2}\\right]\\_{t} = \\gamma E\\left[g^{2}\\right]\\_{t-1} + \\left(1 - \\gamma\\right) g^{2}\\_{t}$$\r\n\r\n$$\\theta\\_{t+1} = \\theta\\_{t} - \\frac{\\eta}{\\sqrt{E\\left[g^{2}\\right]\\_{t} + \\epsilon}}g\\_{t}$$\r\n\r\nHinton suggests $\\gamma=0.9$, with a good default for $\\eta$ as $0.001$.\r\n\r\nImage: [Alec Radford](https://twitter.com/alecrad)",
  "title": null,
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Stochastic Weight Averaging",
  "full_name": "Stochastic Weight Averaging",
  "description": "**Stochastic Weight Averaging** is an optimization procedure that averages multiple points along the trajectory of [SGD](https://paperswithcode.com/method/sgd), with a cyclical or constant learning rate. On the one hand it averages weights, but it also has the property that, with a cyclical or constant learning rate, SGD proposals are approximately sampling from the loss surface of the network, leading to stochastic weights and helping to discover broader optima.",
  "title": "Averaging Weights Leads to Wider Optima and Better Generalization",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Pythia",
  "full_name": "Pythia",
  "description": "**Pythia** is a suite of decoder-only autoregressive language models all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. The model architecture and hyperparameters largely follow GPT-3, with a few notable deviations based on recent advances in best practices for large scale language modeling.",
  "title": "Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "GCU",
  "full_name": "Growing Cosine Unit",
  "description": "An oscillatory function defined as $x \\cdot cos(x)$ that reports better performance than Sigmoid, Mish, Swish, and ReLU on several benchmarks.",
  "title": "Growing Cosine Unit: A Novel Oscillatory Activation Function That Can Speedup Training and Reduce Parameters in Convolutional Neural Networks",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Weights Reset",
  "full_name": "Weights Reset",
  "description": "Weight Reset is an implicit regularization procedure that periodically resets a randomly selected portion of layer weights during the training process, according to predefined probability distributions.\r\n\r\nTo delineate the Weight Reset procedure, a straightforward formulation is introduced. Assume $\\mathcal{B}(p)$ as a multivariate Bernoulli distribution with parameter $p$, and let's propose that $\\mathcal{D}$ is an arbitrary distribution used for initializing model weights. At specified intervals (after a certain number of training iterations, except for the last one), a random portion of the weights $W={w^l}$ from selected layers in the neural network undergoes a reset utilizing the following method:\r\n$$\r\n\\tilde{w}^l = w^l\\cdot (1-m) + \\xi\\cdot m,\r\n$$\r\nwhere $\\cdot$ operation is an element-wise hadamar type multiplication, $w^l$ are current weights for layer $l$, $\\tilde{w}^l$ are reset weights for this layer, $m \\sim \\mathcal{B}(p^l)$ is a resetting mask, $p^l$ is a resetting rate for a layer $l$, $\\xi \\sim \\mathcal{D}$ are new random weights.\r\n\r\nEvidence has indicated that Weight Reset can compete with, and in some instances, surpass traditional regularization techniques.\r\n\r\nGiven the observable effects of the Weight Reset technique on an increasing number of weights in a model, there's a plausible hypothesis suggesting its potential association with the Double Descent phenomenon.",
  "title": "The Weights Reset Technique for Deep Neural Networks Implicit Regularization",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "YellowFin",
  "full_name": "YellowFin",
  "description": "**YellowFin** is a learning rate and momentum tuner motivated by robustness properties and analysis of quadratic objectives. It stems from a known but obscure fact: the momentum operator's spectral radius is constant in a large subset of the hyperparameter space. For quadratic objectives, the optimizer tunes both the learning rate and the momentum to keep the hyperparameters within a region in which the convergence rate is a constant rate equal to the root momentum. This notion is extended empirically to non-convex objectives. On every iteration, YellowFin optimizes the hyperparameters to minimize a local quadratic optimization.",
  "title": "YellowFin and the Art of Momentum Tuning",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "SWRNet",
  "full_name": "A Deep Learning Approach for Small Surface Water Area Recognition Onboard Satellite",
  "description": "",
  "title": "SWRNet: A Deep Learning Approach for Small Surface Water Area Recognition Onboard Satellite",
  "collection": "Image Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "Global and Sliding Window Attention",
  "full_name": "Global and Sliding Window Attention",
  "description": "**Global and Sliding Window Attention** is an attention pattern for attention-based models. It is motivated by the fact that non-sparse attention in the original [Transformer](https://paperswithcode.com/method/transformer) formulation has a [self-attention component](https://paperswithcode.com/method/scaled) with $O\\left(n^{2}\\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. \r\n\r\nSince [windowed](https://paperswithcode.com/method/sliding-window-attention) and [dilated](https://paperswithcode.com/method/dilated-sliding-window-attention) attention patterns are not flexible enough to learn task-specific representations, the authors of the [Longformer](https://paperswithcode.com/method/longformer) add “global attention” on few pre-selected input locations. This attention is operation symmetric: that is, a token with a global attention attends to all tokens across the sequence, and all tokens in the sequence attend to it. The Figure to the right shows an example of a sliding window attention with global attention at a few tokens at custom locations. For the example of classification, global attention is used for the [CLS] token, while in the example of Question Answering, global attention is provided on all question tokens.",
  "title": "Longformer: The Long-Document Transformer",
  "collection": "Attention Patterns",
  "area": "Natural Language Processing"
}
{
  "name": "WaveRNN",
  "full_name": "WaveRNN",
  "description": "**WaveRNN** is a single-layer recurrent neural network for audio generation that is designed efficiently predict 16-bit raw audio samples.\r\n\r\nThe overall computation in the WaveRNN is as follows (biases omitted for brevity):\r\n\r\n$$ \\mathbf{x}\\_{t} = \\left[\\mathbf{c}\\_{t−1},\\mathbf{f}\\_{t−1}, \\mathbf{c}\\_{t}\\right] $$\r\n\r\n$$ \\mathbf{u}\\_{t} = \\sigma\\left(\\mathbf{R}\\_{u}\\mathbf{h}\\_{t-1} + \\mathbf{I}^{*}\\_{u}\\mathbf{x}\\_{t}\\right) $$\r\n\r\n$$ \\mathbf{r}\\_{t} = \\sigma\\left(\\mathbf{R}\\_{r}\\mathbf{h}\\_{t-1} + \\mathbf{I}^{*}\\_{r}\\mathbf{x}\\_{t}\\right) $$\r\n\r\n$$ \\mathbf{e}\\_{t} = \\tau\\left(\\mathbf{r}\\_{t} \\odot \\left(\\mathbf{R}\\_{e}\\mathbf{h}\\_{t-1}\\right) + \\mathbf{I}^{*}\\_{e}\\mathbf{x}\\_{t} \\right) $$\r\n\r\n$$ \\mathbf{h}\\_{t} = \\mathbf{u}\\_{t} \\cdot \\mathbf{h}\\_{t-1} + \\left(1-\\mathbf{u}\\_{t}\\right) \\cdot \\mathbf{e}\\_{t} $$\r\n\r\n$$ \\mathbf{y}\\_{c}, \\mathbf{y}\\_{f} = \\text{split}\\left(\\mathbf{h}\\_{t}\\right) $$\r\n\r\n$$ P\\left(\\mathbf{c}\\_{t}\\right) = \\text{softmax}\\left(\\mathbf{O}\\_{2}\\text{relu}\\left(\\mathbf{O}\\_{1}\\mathbf{y}\\_{c}\\right)\\right) $$\r\n\r\n$$ P\\left(\\mathbf{f}\\_{t}\\right) = \\text{softmax}\\left(\\mathbf{O}\\_{4}\\text{relu}\\left(\\mathbf{O}\\_{3}\\mathbf{y}\\_{f}\\right)\\right) $$\r\n\r\nwhere the $*$ indicates a masked matrix whereby the last coarse input $\\mathbf{c}\\_{t}$ is only connected to the fine part of the states $\\mathbf{u}\\_{t}$, $\\mathbf{r}\\_{t}$, $\\mathbf{e}\\_{t}$ and $\\mathbf{h}\\_{t}$ and thus only affects the fine output $\\mathbf{y}\\_{f}$. The coarse and fine parts $\\mathbf{c}\\_{t}$ and $\\mathbf{f}\\_{t}$ are encoded as scalars in $\\left[0, 255\\right]$ and scaled to the interval $\\left[−1, 1\\right]$. The matrix $\\mathbf{R}$ formed from the matrices $\\mathbf{R}\\_{u}$, $\\mathbf{R}\\_{r}$, $\\mathbf{R}\\_{e}$ is computed as a single matrix-vector product to produce the contributions to all three gates $\\mathbf{u}\\_{t}$, $mathbf{r}\\_{t}$ and $\\mathbf{e}\\_{t}$ (a variant of the [GRU cell](https://paperswithcode.com/method/gru). $\\sigma$ and $\\tau$ are the standard sigmoid and tanh non-linearities.\r\n\r\nEach part feeds into a [softmax](https://paperswithcode.com/method/softmax) layer over the corresponding 8 bits and the prediction of the 8 fine bits is conditioned on the 8 coarse bits. The resulting Dual Softmax layer allows for efficient prediction of 16-bit samples using two small output spaces (2 8 values each) instead of a single large output space (with 2 16 values).",
  "title": "Efficient Neural Audio Synthesis",
  "collection": "Generative Audio Models",
  "area": "Audio"
}
{
  "name": "U-Net",
  "full_name": "U-Net",
  "description": "**U-Net** is an architecture for semantic segmentation. It consists of a contracting path and an expansive path. The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit ([ReLU](https://paperswithcode.com/method/relu)) and a 2x2 [max pooling](https://paperswithcode.com/method/max-pooling) operation with stride 2 for downsampling. At each downsampling step we double the number of feature channels. Every step in the expansive path consists of an upsampling of the feature map followed by a 2x2 [convolution](https://paperswithcode.com/method/convolution) (“up-convolution”) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU. The cropping is necessary due to the loss of border pixels in every convolution. At the final layer a [1x1 convolution](https://paperswithcode.com/method/1x1-convolution) is used to map each 64-component feature vector to the desired number of classes. In total the network has 23 convolutional layers.\r\n\r\n[Original MATLAB Code](https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/u-net-release-2015-10-02.tar.gz)",
  "title": "U-Net: Convolutional Networks for Biomedical Image Segmentation",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "DRPNN",
  "full_name": "Deep Residual Pansharpening Neural Network",
  "description": "In the field of fusing multi-spectral and panchromatic images (Pan-sharpening), the impressive effectiveness of deep neural networks has been recently employed to overcome the drawbacks of traditional linear models and boost the fusing accuracy. However, to the best of our knowledge, existing research works are mainly based on simple and flat networks with relatively shallow architecture, which severely limited their performances. In this paper, the concept of residual learning has been introduced to form a very deep convolutional neural network to make a full use of the high non-linearity of deep learning models. By both quantitative and visual assessments on a large number of high quality multi-spectral images from various sources, it has been supported that our proposed model is superior to all mainstream algorithms included in the comparison, and achieved the highest spatial-spectral unified accuracy.",
  "title": "Boosting the accuracy of multi-spectral image pan-sharpening by learning a deep residual network",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "MobileNetV1",
  "full_name": "MobileNetV1",
  "description": "**MobileNet** is a type of convolutional neural network designed for mobile and embedded vision applications. They are based on a streamlined architecture that uses depthwise separable convolutions to build lightweight deep neural networks that can have low latency for mobile and embedded devices.",
  "title": "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Smish",
  "full_name": "Smish",
  "description": "Smish is an activation function defined as $f(x)=x\\cdot \\text{tanh}(\\ln(1+\\sigma(x)))$ where $\\sigma(x)$ denotes the sigmoid function. A parameterized version was also described in the form $f(x)=\\alpha x\\cdot \\text{tanh}(\\ln(1+\\sigma(\\beta x)))$.\r\n\r\nPaper: Smish: A Novel Activation Function for Deep Learning Methods\r\n\r\nSource: https://www.mdpi.com/2079-9292/11/4/540",
  "title": null,
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "SimCLR",
  "full_name": "SimCLR",
  "description": "**SimCLR** is a framework for contrastive learning of visual representations. It learns representations by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space. It consists of:\r\n\r\n- A stochastic data augmentation module that transforms any given data example randomly resulting in two correlated views of the same example, denoted $\\mathbf{\\tilde{x}\\_{i}}$ and $\\mathbf{\\tilde{x}\\_{j}}$, which is considered a positive pair. SimCLR sequentially applies three simple augmentations: random cropping followed by resize back to the original size, random color distortions, and [random Gaussian blur](https://paperswithcode.com/method/random-gaussian-blur). The authors find random crop and color distortion is crucial to achieve good performance.\r\n\r\n- A neural network base encoder $f\\left(·\\right)$ that extracts representation vectors from augmented data examples. The framework allows various choices of the network architecture without any constraints. The authors opt for simplicity and adopt [ResNet](https://paperswithcode.com/method/resnet) to obtain $h\\_{i} = f\\left(\\mathbf{\\tilde{x}}\\_{i}\\right) = \\text{ResNet}\\left(\\mathbf{\\tilde{x}}\\_{i}\\right)$ where $h\\_{i} \\in \\mathbb{R}^{d}$ is the output after the [average pooling](https://paperswithcode.com/method/average-pooling) layer.\r\n\r\n- A small neural network projection head $g\\left(·\\right)$ that maps representations to the space where contrastive loss is applied. Authors use a MLP with one hidden layer to obtain $z\\_{i} = g\\left(h\\_{i}\\right) = W^{(2)}\\sigma\\left(W^{(1)}h\\_{i}\\right)$ where $\\sigma$ is a [ReLU](https://paperswithcode.com/method/relu) nonlinearity. The authors find it beneficial to define the contrastive loss on $z\\_{i}$’s rather than $h\\_{i}$’s.\r\n\r\n- A contrastive loss function defined for a contrastive prediction task. Given a set {$\\mathbf{\\tilde{x}}\\_{k}$} including a positive pair of examples $\\mathbf{\\tilde{x}}\\_{i}$ and $\\mathbf{\\tilde{x}\\_{j}}$ , the contrastive prediction task aims to identify $\\mathbf{\\tilde{x}}\\_{j}$ in {$\\mathbf{\\tilde{x}}\\_{k}$}$\\_{k\\neq{i}}$ for a given $\\mathbf{\\tilde{x}}\\_{i}$.\r\n\r\nA minibatch of $N$ examples is randomly sampled and the contrastive prediction task is defined on pairs of augmented examples derived from the minibatch, resulting in $2N$ data points. Negative examples are not sampled explicitly. Instead, given a positive pair, the other $2(N − 1)$ augmented examples within a minibatch are treated as negative examples. A [NT-Xent](https://paperswithcode.com/method/nt-xent) (the normalized\r\ntemperature-scaled cross entropy loss) loss function is used (see components).",
  "title": "A Simple Framework for Contrastive Learning of Visual Representations",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "Random Horizontal Flip",
  "full_name": "Random Horizontal Flip",
  "description": "**RandomHorizontalFlip** is a type of image data augmentation which horizontally flips a given image with a given probability.\r\n\r\nImage Credit: [Apache MXNet](https://mxnet.apache.org/versions/1.5.0/tutorials/gluon/data_augmentation.html)",
  "title": null,
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "MetaFormer",
  "full_name": "MetaFormer",
  "description": "MetaFormer is a general architecture abstracted from Transformers by not specifying the token mixer.",
  "title": "MetaFormer Is Actually What You Need for Vision",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "HANet",
  "full_name": "Height-driven Attention Network",
  "description": "**Height-driven Attention Network**, or **HANet**, is a general add-on module for improving semantic segmentation for urban-scene images. It emphasizes informative features or classes selectively according to the vertical position of a pixel. The pixel-wise class distributions are significantly different from each other among horizontally segmented sections in the urban-scene images. Likewise, urban-scene images have their own distinct characteristics, but most semantic segmentation networks do not reflect such unique attributes in the architecture. The proposed network architecture incorporates the capability exploiting the attributes to handle the urban scene dataset effectively.",
  "title": "Cars Can't Fly up in the Sky: Improving Urban-Scene Segmentation via Height-driven Attention Networks",
  "collection": "Image Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "Balanced Feature Pyramid",
  "full_name": "Balanced Feature Pyramid",
  "description": "**Balanced Feature Pyramid** is a feature pyramid module. It differs from approaches like [FPNs](https://paperswithcode.com/method/fpn) that integrate multi-level features using lateral connections. Instead the BFP strengthens the multi-level features using the same deeply integrated balanced semantic features. The pipeline is shown in the Figure to the right. It consists of four steps, rescaling, integrating, refining and strengthening.\r\n\r\nFeatures at resolution level $l$ are denoted as $C\\_{l}$. The number of multi-level features is denoted as $L$. The indexes of involved lowest and highest levels are denoted as $l\\_{min}$ and $l\\_{max}$. In the Figure, $C\\_{2}$ has the highest resolution. To integrate multi-level features and preserve their semantic hierarchy at the same time, we first resize the multi-level features {$C\\_{2}, C\\_{3}, C\\_{4}, C\\_{5}$} to an intermediate size, i.e., the same size as $C\\_{4}$, with interpolation and max-pooling respectively. Once the features are rescaled, the balanced semantic features are obtained by simple averaging as:\r\n\r\n$$ C = \\frac{1}{L}\\sum^{l\\_{max}}\\_{l=l\\_{min}}C\\_{l} $$\r\n\r\nThe obtained features are then rescaled using the same but reverse procedure to strengthen the original features. Each resolution obtains equal information from others in this procedure. Note that this procedure does not contain any parameter. The authors observe improvement with this nonparametric method, proving the effectiveness of the information flow. \r\n\r\nThe balanced semantic features can be further refined to be more discriminative. The authors found both the refinements with convolutions directly and the non-local module work well. But the\r\nnon-local module works in a more stable way. Therefore, embedded Gaussian non-local attention is utilized as default. The refining step helps us enhance the integrated features and further improve the results.\r\n\r\nWith this method, features from low-level to high-level are aggregated at the same time. The outputs\r\n{$P\\_{2}, P\\_{3}, P\\_{4}, P\\_{5}$} are used for object detection following the same pipeline in FPN.",
  "title": "Libra R-CNN: Towards Balanced Learning for Object Detection",
  "collection": "Feature Pyramid Blocks",
  "area": "Computer Vision"
}
{
  "name": "Dilated Causal Convolution",
  "full_name": "Dilated Causal Convolution",
  "description": "A **Dilated Causal Convolution** is a [causal convolution](https://paperswithcode.com/method/causal-convolution) where the filter is applied over an area larger than its length by skipping input values with a certain step. A dilated causal [convolution](https://paperswithcode.com/method/convolution) effectively allows the network to have very large receptive fields with just a few layers.",
  "title": "WaveNet: A Generative Model for Raw Audio",
  "collection": "Temporal Convolutions",
  "area": "Sequential"
}
{
  "name": "DSPT",
  "full_name": "double-stage parameter tuning",
  "description": "Parameter tuning method for neural network models with adaptive activation functions.",
  "title": "Adaptive hybrid activation function for deep neural networks",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "AdaShift",
  "full_name": "AdaShift",
  "description": "**AdaShift** is a type of adaptive stochastic optimizer that decorrelates $v\\_{t}$ and $g\\_{t}$ in [Adam](https://paperswithcode.com/method/adam) by temporal shifting, i.e., using temporally shifted gradient $g\\_{t−n}$ to calculate $v\\_{t}$. The authors argue that an inappropriate correlation between gradient $g\\_{t}$ and the second-moment term $v\\_{t}$ exists in Adam, which results in a large gradient being likely to have a small step size while a small gradient may have a large step size. The authors argue that such biased step sizes are the fundamental cause of non-convergence of Adam.\r\n\r\nThe AdaShift updates, based on the idea of temporal independence between gradients, are as follows:\r\n\r\n$$ g\\_{t} = \\nabla{f\\_{t}}\\left(\\theta\\_{t}\\right) $$\r\n\r\n$$ m\\_{t} = \\sum^{n-1}\\_{i=0}\\beta^{i}\\_{1}g\\_{t-i}/\\sum^{n-1}\\_{i=0}\\beta^{i}\\_{1} $$\r\n\r\nThen for $i=1$ to $M$:\r\n\r\n$$ v\\_{t}\\left[i\\right] = \\beta\\_{2}v\\_{t-1}\\left[i\\right] + \\left(1-\\beta\\_{2}\\right)\\phi\\left(g^{2}\\_{t-n}\\left[i\\right]\\right) $$\r\n\r\n$$ \\theta\\_{t}\\left[i\\right] = \\theta\\_{t-1}\\left[i\\right] - \\alpha\\_{t}/\\sqrt{v\\_{t}\\left[i\\right]}\\cdot{m\\_{t}\\left[i\\right]} $$",
  "title": "AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "SFAM",
  "full_name": "Scale-wise Feature Aggregation Module",
  "description": "**SFAM**, or **Scale-wise Feature Aggregation Module**, is a feature extraction block from the [M2Det](https://paperswithcode.com/method/m2det) architecture. It aims to aggregate the multi-level multi-scale features generated by [Thinned U-Shaped Modules](https://paperswithcode.com/method/tum) into a multi-level feature pyramid. \r\n\r\nThe first stage of SFAM is to concatenate features of the equivalent scale together along the channel dimension. The aggregated feature pyramid can be presented as $\\mathbf{X} =[\\mathbf{X}\\_1,\\mathbf{X}\\_2,\\dots,\\mathbf{X}\\_i]$, where $\\mathbf{X}\\_i = \\text{Concat}(\\mathbf{x}\\_i^1,\\mathbf{x}\\_i^2,\\dots,\\mathbf{x}\\_i^L) \\in \\mathbb{R}^{W\\_{i}\\times H\\_{i}\\times C}$ refers to the features of the $i$-th largest scale. Here, each scale in the aggregated pyramid contains features from multi-level depths. \r\n\r\nHowever, simple concatenation operations are not adaptive enough. In the second stage, we introduce a channel-wise attention module to encourage features to focus on channels that they benefit most. Following Squeeze-and-Excitation, we use [global average pooling](https://paperswithcode.com/method/global-average-pooling) to generate channel-wise statistics $\\mathbf{z} \\in \\mathbb{R}^C$ at the squeeze step. And to fully capture channel-wise dependencies, the following excitation step learns the attention mechanism via two fully connected layers:\r\n\r\n$$\r\n\\mathbf{s} = \\mathbf{F}\\_{ex}(\\mathbf{z},\\mathbf{W}) = \\sigma(\\mathbf{W}\\_{2} \\delta(\\mathbf{W}\\_{1}\\mathbf{z})),\r\n$$\r\n\r\nwhere $\\sigma$ refers to the [ReLU](https://paperswithcode.com/method/relu) function, $\\delta$ refers to the sigmoid function, $\\mathbf{W}\\_{1} \\in \\mathbb{R}^{\\frac{C}{r}\\times C}$ , $\\mathbf{W}\\_{2} \\in \\mathbb{R}^{C\\times \\frac{C}{r}}$, r is the reduction ratio ($r=16$ in our experiments). The final output is obtained by reweighting the input $\\mathbf{X}$ with activation $\\mathbf{s}$:\r\n\r\n$$\r\n\\tilde{\\mathbf{X}}_i^c = \\mathbf{F}\\_{scale}(\\mathbf{X}\\_i^c,s_c) = s_c \\cdot \\mathbf{X}_i^c,\r\n$$\r\n\r\nwhere $\\tilde{\\mathbf{X}\\_i} = [\\tilde{\\mathbf{X}}\\_i^1,\\tilde{\\mathbf{X}}\\_i^2,...,\\tilde{\\mathbf{X}}\\_i^C]$, each of the features is enhanced or weakened by the rescaling operation.",
  "title": "M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "ERU",
  "full_name": "Efficient Recurrent Unit",
  "description": "An **Efficient Recurrent Unit (ERU)** extends [LSTM](https://paperswithcode.com/method/mrnn)-based language models by replacing linear transforms for processing the input vector with the [EESP](https://paperswithcode.com/method/eesp) unit inside the [LSTM](https://paperswithcode.com/method/lstm) cell.",
  "title": "ESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural Network",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "Electric",
  "full_name": "Electric",
  "description": "**Electric** is an energy-based cloze model for representation learning over text. Like BERT, it is a conditional generative model of tokens given their contexts. However, Electric does not use masking or output a full distribution over tokens that could occur in a context. Instead, it assigns a scalar energy score to each input token indicating how likely it is given its context.\r\n\r\nSpecifically, like BERT, Electric also models $p\\_{\\text {data }}\\left(x\\_{t} \\mid \\mathbf{x}\\_{\\backslash t}\\right)$, but does not use masking or a softmax layer. Electric first maps the unmasked input $\\mathbf{x}=\\left[x\\_{1}, \\ldots, x\\_{n}\\right]$ into contextualized vector representations $\\mathbf{h}(\\mathbf{x})=\\left[\\mathbf{h}\\_{1}, \\ldots, \\mathbf{h}\\_{n}\\right]$ using a transformer network. The model assigns a given position $t$ an energy score\r\n\r\n$$\r\nE(\\mathbf{x})\\_{t}=\\mathbf{w}^{T} \\mathbf{h}(\\mathbf{x})\\_{t}\r\n$$\r\n\r\nusing a learned weight vector $w$. The energy function defines a distribution over the possible tokens at position $t$ as\r\n\r\n$$\r\np\\_{\\theta}\\left(x\\_{t} \\mid \\mathbf{x}_{\\backslash t}\\right)=\\exp \\left(-E(\\mathbf{x})\\_{t}\\right) / Z\\left(\\mathbf{x}\\_{\\backslash t}\\right) \r\n$$\r\n\r\n$$\r\n=\\frac{\\exp \\left(-E(\\mathbf{x})\\_{t}\\right)}{\\sum\\_{x^{\\prime} \\in \\mathcal{V}} \\exp \\left(-E\\left(\\operatorname{REPLACE}\\left(\\mathbf{x}, t, x^{\\prime}\\right)\\right)\\_{t}\\right)}\r\n$$\r\n\r\nwhere $\\text{REPLACE}\\left(\\mathbf{x}, t, x^{\\prime}\\right)$ denotes replacing the token at position $t$ with $x^{\\prime}$ and $\\mathcal{V}$ is the vocabulary, in practice usually word pieces. Unlike with BERT, which produces the probabilities for all possible tokens $x^{\\prime}$ using a softmax layer, a candidate $x^{\\prime}$ is passed in as input to the transformer. As a result, computing $p_{\\theta}$ is prohibitively expensive because the partition function $Z\\_{\\theta}\\left(\\mathbf{x}\\_{\\backslash t}\\right)$ requires running the transformer $|\\mathcal{V}|$ times; unlike most EBMs, the intractability of $Z\\_{\\theta}(\\mathbf{x} \\backslash t)$ is more due to the expensive scoring function rather than having a large sample space.",
  "title": "Pre-Training Transformers as Energy-Based Cloze Models",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "ReZero",
  "full_name": "ReZero",
  "description": "**ReZero** is a [normalization](https://paperswithcode.com/methods/category/normalization) approach that dynamically facilitates well-behaved gradients and arbitrarily deep signal propagation. The idea is simple: ReZero initializes each layer to perform the identity operation. For each layer,  a [residual connection](https://paperswithcode.com/method/residual-connectio) is introduced for the input signal $x$ and one trainable parameter $\\alpha$ that modulates the non-trivial transformation of a layer $F(\\mathbf{x})$:\r\n\r\n$$\r\n\\mathbf{x}\\_{i+1}=\\mathbf{x}\\_{i}+\\alpha_{i} F\\left(\\mathbf{x}\\_{i}\\right)\r\n$$\r\n\r\nwhere $\\alpha=0$ at the beginning of training. Initially the gradients for all parameters defining $F$ vanish, but dynamically evolve to suitable values during initial stages of training. The architecture is illustrated in the Figure.",
  "title": "ReZero is All You Need: Fast Convergence at Large Depth",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "SRS",
  "full_name": "Sticker Response Selector",
  "description": "**Sticker Response Selector**, or **SRS**, is a model for multi-turn dialog that automatically selects a sticker response. SRS first employs a convolutional based sticker image encoder and a self-attention based multi-turn dialog encoder to obtain the representation of stickers and utterances. Next, deep interaction network is proposed to conduct deep matching between the sticker with each utterance in the dialog history. SRS then learns the short-term and long-term dependency between all interaction results by a fusion network to output the the final matching score.",
  "title": "Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog",
  "collection": "Conversational Models",
  "area": "Natural Language Processing"
}
{
  "name": "GENet",
  "full_name": "GPU-Efficient Network",
  "description": "**GENets**, or **GPU-Efficient Networks**, are a family of efficient models found through [neural architecture search](https://paperswithcode.com/methods/category/neural-architecture-search). The search occurs over several types of convolutional block, which include [depth-wise convolutions](https://paperswithcode.com/method/depthwise-convolution), [batch normalization](https://paperswithcode.com/method/batch-normalization), [ReLU](https://paperswithcode.com/method/relu), and an [inverted bottleneck](https://paperswithcode.com/method/inverted-residual-block) structure.",
  "title": "Neural Architecture Design for GPU-Efficient Networks",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "MFR",
  "full_name": "Meta Face Recognition",
  "description": "**Meta Face Recognition** (MFR) is a meta-learning face recognition method. MFR synthesizes the source/target domain shift with a meta-optimization objective, which requires the model to learn effective representations not only on synthesized source domains but also on synthesized target domains. Specifically, domain-shift batches are built through a domain-level sampling strategy and back-propagated gradients/meta-gradients are obtained on synthesized source/target domains by optimizing multi-domain distributions. The gradients and meta-gradients are further combined to update the model to improve generalization.",
  "title": "Learning Meta Face Recognition in Unseen Domains",
  "collection": "Face Recognition Models",
  "area": "Computer Vision"
}
{
  "name": "Attention Gate",
  "full_name": "Attention Gate",
  "description": "Attention gate focuses on targeted regions while suppressing feature activations in irrelevant regions.\r\nGiven the input feature map $X$ and the gating signal $G\\in \\mathbb{R}^{C'\\times H\\times W}$ which is collected at a coarse scale and contains contextual information, the attention gate uses additive attention to obtain the gating coefficient. Both the input $X$ and the gating signal are first linearly mapped to an $\\mathbb{R}^{F\\times H\\times W}$ dimensional space, and then the output is squeezed in the channel domain to produce a spatial attention weight map $ S \\in \\mathbb{R}^{1\\times H\\times W}$. The overall process can be written as\r\n\\begin{align}\r\n    S &= \\sigma(\\varphi(\\delta(\\phi_x(X)+\\phi_g(G))))\r\n\\end{align}\r\n\\begin{align}\r\n    Y &= S X\r\n\\end{align}\r\nwhere $\\varphi$, $\\phi_x$ and $\\phi_g$ are linear transformations implemented as $1\\times 1$ convolutions. \r\n\r\nThe attention gate guides the model's attention to important regions while suppressing feature activation in unrelated areas. It substantially enhances the representational power of the model without a significant increase in computing cost or number of model parameters due to its lightweight design. It is general and modular, making it simple to use in various CNN models.",
  "title": "Attention U-Net: Learning Where to Look for the Pancreas",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "AdaDelta",
  "full_name": "AdaDelta",
  "description": "**AdaDelta** is a stochastic optimization technique that allows for per-dimension learning rate method for [SGD](https://paperswithcode.com/method/sgd). It is an extension of [Adagrad](https://paperswithcode.com/method/adagrad) that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to a fixed size $w$.\r\n\r\nInstead of inefficiently storing $w$ previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. The running average $E\\left[g^{2}\\right]\\_{t}$ at time step $t$ then depends only on the previous average and current gradient:\r\n\r\n$$E\\left[g^{2}\\right]\\_{t} = \\gamma{E}\\left[g^{2}\\right]\\_{t-1} + \\left(1-\\gamma\\right)g^{2}\\_{t}$$\r\n\r\nUsually $\\gamma$ is set to around $0.9$. Rewriting SGD updates in terms of the parameter update vector:\r\n\r\n$$ \\Delta\\theta_{t} = -\\eta\\cdot{g\\_{t, i}}$$\r\n$$\\theta\\_{t+1}  = \\theta\\_{t} + \\Delta\\theta_{t}$$\r\n\r\nAdaDelta takes the form:\r\n\r\n$$ \\Delta\\theta_{t} = -\\frac{\\eta}{\\sqrt{E\\left[g^{2}\\right]\\_{t} + \\epsilon}}g_{t} $$\r\n\r\nThe main advantage of AdaDelta is that we do not need to set a default learning rate.",
  "title": "ADADELTA: An Adaptive Learning Rate Method",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "CornerNet-Squeeze Hourglass Module",
  "full_name": "CornerNet-Squeeze Hourglass Module",
  "description": "**CornerNet-Squeeze Hourglass Module** is an image model block used in [CornerNet](https://paperswithcode.com/method/cornernet)-Lite that is based on an [hourglass module](https://paperswithcode.com/method/hourglass-module), but uses modified fire modules instead of residual blocks. Other than replacing the residual blocks, further modifications include: reducing the maximum feature map resolution of the hourglass modules by adding one more downsampling layer before the hourglass modules, removing one downsampling layer in each hourglass module, replacing the 3 × 3 filters with 1 x 1 filters in the prediction modules of CornerNet, and finally replacing the nearest neighbor upsampling in the hourglass network with transpose [convolution](https://paperswithcode.com/method/convolution) with a 4 × 4 kernel.",
  "title": "CornerNet-Lite: Efficient Keypoint Based Object Detection",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "OASIS",
  "full_name": "OASIS",
  "description": "OASIS is a [GAN](https://paperswithcode.com/method/gan)-based model to translate semantic label maps into realistic-looking images. The model builds on preceding work such as [Pix2Pix](https://paperswithcode.com/method/pix2pix) and SPADE. OASIS introduces the following innovations:  \r\n\r\n1. The method is not dependent on the perceptual loss, which is commonly used for the semantic image synthesis task. A [VGG](https://paperswithcode.com/method/vgg) network trained on ImageNet is routinely employed as the perceptual loss to strongly improve the synthesis quality. The authors show that this perceptual loss also has negative effects: First, it reduces the diversity of the generated images. Second, it negatively influences the color distribution to be more biased towards ImageNet. OASIS eliminates the dependence on the perceptual loss by changing the common discriminator design: The OASIS discriminator segments an image into one of the real classes or an additional fake class. In doing so, it makes more efficient use of the label maps that the discriminator normally receives. This distinguishes the discriminator from the commonly used encoder-shaped discriminators, which concatenate the label maps to the input image and predict a single score per image. With the more fine-grained supervision through the loss of the OASIS discriminator, the perceptual loss is shown to become unnecessary.\r\n\r\n2. A user can generate a diverse set of images per label map by simply resampling noise. This is achieved by conditioning the [spatially-adaptive denormalization](https://arxiv.org/abs/1903.07291) module in each layer of the GAN generator directly on spatially replicated input noise. A side effect of this conditioning is that at inference time an image can be resampled either globally or locally (either the complete image changes or a restricted region in the image).",
  "title": "You Only Need Adversarial Supervision for Semantic Image Synthesis",
  "collection": "Conditional Image-to-Image Translation Models",
  "area": "Computer Vision"
}
{
  "name": "Path Length Regularization",
  "full_name": "Path Length Regularization",
  "description": "**Path Length Regularization** is a type of regularization for [generative adversarial networks](https://paperswithcode.com/methods/category/generative-adversarial-networks) that encourages good conditioning in the mapping from latent codes to images. The idea is to encourage that a fixed-size step in the latent space $\\mathcal{W}$ results in a non-zero, fixed-magnitude change in the image.\r\n\r\nWe can measure the deviation from this ideal empirically by stepping into random directions in the image space and observing the corresponding $\\mathbf{w}$ gradients. These gradients should have close to an equal length regardless of $\\mathbf{w}$ or the image-space direction, indicating that the mapping from the latent space to image space is well-conditioned.\r\n\r\nAt a single $\\mathbf{w} \\in \\mathcal{W}$ the local metric scaling properties of the generator mapping $g\\left(\\mathbf{w}\\right) : \\mathcal{W} \\rightarrow \\mathcal{Y}$ are captured by the Jacobian matrix $\\mathbf{J\\_{w}} = \\delta{g}\\left(\\mathbf{w}\\right)/\\delta{\\mathbf{w}}$. Motivated by the desire to preserve the expected lengths of vectors regardless of the direction, we formulate the regularizer as:\r\n\r\n$$ \\mathbb{E}\\_{\\mathbf{w},\\mathbf{y} \\sim \\mathcal{N}\\left(0, \\mathbf{I}\\right)} \\left(||\\mathbf{J}^{\\mathbf{T}}\\_{\\mathbf{w}}\\mathbf{y}||\\_{2} - a\\right)^{2} $$\r\n\r\nwhere $y$ are random images with normally distributed pixel intensities, and $w \\sim f\\left(z\\right)$, where $z$ are normally distributed. \r\n\r\nTo avoid explicit computation of the Jacobian matrix, we use the identity $\\mathbf{J}^{\\mathbf{T}}\\_{\\mathbf{w}}\\mathbf{y} = \\nabla\\_{\\mathbf{w}}\\left(g\\left(\\mathbf{w}\\right)·y\\right)$, which is efficiently computable using standard backpropagation. The constant $a$ is set dynamically during optimization as the long-running exponential moving average of the lengths $||\\mathbf{J}^{\\mathbf{T}}\\_{\\mathbf{w}}\\mathbf{y}||\\_{2}$, allowing the optimization to find a suitable global scale by itself.\r\n\r\nThe authors note that they find that path length regularization leads to more reliable and consistently behaving models, making architecture exploration easier. They also observe that the smoother generator is significantly easier to invert.",
  "title": "Analyzing and Improving the Image Quality of StyleGAN",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Causal Convolution",
  "full_name": "Causal Convolution",
  "description": "**Causal convolutions** are a type of [convolution](https://paperswithcode.com/method/convolution) used for temporal data which ensures the model cannot violate the ordering in which we model the data: the prediction $p(x_{t+1} | x_{1}, \\ldots, x_{t})$ emitted by the model at timestep $t$ cannot depend on any of the future timesteps $x_{t+1}, x_{t+2}, \\ldots, x_{T}$. For images, the equivalent of a causal convolution is a [masked convolution](https://paperswithcode.com/method/masked-convolution) which can be implemented by constructing a mask tensor and doing an element-wise multiplication of this mask with the convolution kernel before applying it. For 1-D data such as audio one can more easily implement this by shifting the output of a normal convolution by a few timesteps.",
  "title": "WaveNet: A Generative Model for Raw Audio",
  "collection": "Temporal Convolutions",
  "area": "Sequential"
}
{
  "name": "GATv2",
  "full_name": "Graph Attention Network v2",
  "description": "The __GATv2__ operator from the [“How Attentive are Graph Attention Networks?”](https://arxiv.org/abs/2105.14491) paper, which fixes the static attention problem of the standard [GAT](https://paperswithcode.com/method/gat) layer: since the linear layers in the standard GAT are applied right after each other, the ranking of attended nodes is unconditioned on the query node. In contrast, in GATv2, every node can attend to any other node.\r\n\r\nGATv2 scoring function:\r\n\r\n$e_{i,j} =\\mathbf{a}^{\\top}\\mathrm{LeakyReLU}\\left(\\mathbf{W}[\\mathbf{h}_i \\, \\Vert \\,\\mathbf{h}_j]\\right)$",
  "title": "How Attentive are Graph Attention Networks?",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "DARTS",
  "full_name": "Differentiable Architecture Search",
  "description": "**Differentiable Architecture Search** (**DART**) is a method for efficient architecture search. The search space is made continuous so that the architecture can be optimized with respect to its validation set performance through gradient descent.",
  "title": "DARTS: Differentiable Architecture Search",
  "collection": "Neural Architecture Search",
  "area": "General"
}
{
  "name": "MEUZZ",
  "full_name": "MEUZZ",
  "description": "**MEUZZ** is a machine learning-based hybrid fuzzer which employs supervised machine learning for adaptive and generalizable seed scheduling -- a prominent factor in determining the yields of hybrid fuzzing. MEUZZ determines which new seeds are expected to produce better fuzzing yields based on the knowledge learned from past seed scheduling decisions made on the same or similar programs. MEUZZ's learning is based on a series of features extracted via code reachability and dynamic analysis, which incurs negligible runtime overhead (in microseconds). Moreover, MEUZZ automatically infers the data labels by evaluating the fuzzing performance of each selected seed.",
  "title": "MEUZZ: Smart Seed Scheduling for Hybrid Fuzzing",
  "collection": "Hybrid Optimization",
  "area": "General"
}
{
  "name": "D4PG",
  "full_name": "Distributed Distributional DDPG",
  "description": "**D4PG**, or **Distributed Distributional DDPG**, is a policy gradient algorithm that extends upon the [DDPG](https://paperswithcode.com/method/ddpg). The improvements include a distributional updates to the DDPG algorithm, combined with the use of multiple distributed workers all writing into the same replay table. The biggest performance gain of other simpler changes was the use of $N$-step returns. The authors found that the use of [prioritized experience replay](https://paperswithcode.com/method/prioritized-experience-replay) was less crucial to the overall D4PG algorithm especially on harder problems.",
  "title": "Distributed Distributional Deterministic Policy Gradients",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "Graph Self-Attention",
  "full_name": "Graph Self-Attention",
  "description": "**Graph Self-Attention (GSA)** is a self-attention module used in the [BP-Transformer](https://paperswithcode.com/method/bp-transformer) architecture, and is based on the [graph attentional layer](https://paperswithcode.com/method/graph-attentional-layer).\r\n\r\nFor a given node $u$, we update its representation according to its neighbour nodes, formulated as $\\mathbf{h}\\_{u} \\leftarrow \\text{GSA}\\left(\\mathcal{G}, \\mathbf{h}^{u}\\right)$.\r\n\r\nLet $\\mathbf{A}\\left(u\\right)$ denote the set of the neighbour nodes of $u$ in $\\mathcal{G}$, $\\text{GSA}\\left(\\mathcal{G}, \\mathbf{h}^{u}\\right)$ is detailed as follows:\r\n\r\n$$ \\mathbf{A}^{u} = \\text{concat}\\left(\\{\\mathbf{h}\\_{v} | v \\in \\mathcal{A}\\left(u\\right)\\}\\right) $$\r\n\r\n$$ \\mathbf{Q}^{u}\\_{i} = \\mathbf{H}\\_{k}\\mathbf{W}^{Q}\\_{i},\\mathbf{K}\\_{i}^{u} = \\mathbf{A}^{u}\\mathbf{W}^{K}\\_{i},\\mathbf{V}^{u}\\_{i} = \\mathbf{A}^{u}\\mathbf{W}\\_{i}^{V} $$\r\n\r\n$$ \\text{head}^{u}\\_{i} = \\text{softmax}\\left(\\frac{\\mathbf{Q}^{u}\\_{i}\\mathbf{K}\\_{i}^{uT}}{\\sqrt{d}}\\right)\\mathbf{V}\\_{i}^{u} $$\r\n\r\n$$ \\text{GSA}\\left(\\mathcal{G}, \\mathbf{h}^{u}\\right) = \\left[\\text{head}^{u}\\_{1}, \\dots, \\text{head}^{u}\\_{h}\\right]\\mathbf{W}^{O}$$\r\n\r\nwhere d is the dimension of h, and $\\mathbf{W}^{Q}\\_{i}$, $\\mathbf{W}^{K}\\_{i}$ and $\\mathbf{W}^{V}\\_{i}$ are trainable parameters of the $i$-th attention head.",
  "title": "BP-Transformer: Modelling Long-Range Context via Binary Partitioning",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "CycleGAN",
  "full_name": "CycleGAN",
  "description": "**CycleGAN**, or **Cycle-Consistent GAN**, is a type of generative adversarial network for unpaired image-to-image translation. For two domains $X$ and $Y$, CycleGAN learns a mapping $G : X \\rightarrow Y$ and $F: Y \\rightarrow X$. The novelty lies in trying to enforce the intuition that these mappings should be reverses of each other and that both mappings should be bijections. This is achieved through a [cycle consistency loss](https://paperswithcode.com/method/cycle-consistency-loss) that encourages $F\\left(G\\left(x\\right)\\right) \\approx x$ and $G\\left(F\\left(y\\right)\\right) \\approx y$. Combining this loss with the adversarial losses on $X$ and $Y$ yields the full objective for unpaired image-to-image translation.\r\n\r\nFor the mapping $G : X \\rightarrow Y$ and its discriminator $D\\_{Y}$ we have the objective:\r\n\r\n$$ \\mathcal{L}\\_{GAN}\\left(G, D\\_{Y}, X, Y\\right) =\\mathbb{E}\\_{y \\sim p\\_{data}\\left(y\\right)}\\left[\\log D\\_{Y}\\left(y\\right)\\right] + \\mathbb{E}\\_{x \\sim p\\_{data}\\left(x\\right)}\\left[log(1 − D\\_{Y}\\left(G\\left(x\\right)\\right)\\right] $$\r\n\r\nwhere $G$ tries to generate images $G\\left(x\\right)$ that look similar to images from domain $Y$, while $D\\_{Y}$ tries to discriminate between translated samples $G\\left(x\\right)$ and real samples $y$. A similar loss is postulated for the mapping $F: Y \\rightarrow X$ and its discriminator $D\\_{X}$.\r\n\r\nThe Cycle Consistency Loss reduces the space of possible mapping functions by enforcing forward and backwards consistency:\r\n\r\n$$ \\mathcal{L}\\_{cyc}\\left(G, F\\right) = \\mathbb{E}\\_{x \\sim p\\_{data}\\left(x\\right)}\\left[||F\\left(G\\left(x\\right)\\right) - x||\\_{1}\\right] + \\mathbb{E}\\_{y \\sim p\\_{data}\\left(y\\right)}\\left[||G\\left(F\\left(y\\right)\\right) - y||\\_{1}\\right] $$\r\n\r\nThe full objective is:\r\n\r\n$$ \\mathcal{L}\\_{GAN}\\left(G, F, D\\_{X}, D\\_{Y}\\right) = \\mathcal{L}\\_{GAN}\\left(G, D\\_{Y}, X, Y\\right) + \\mathcal{L}\\_{GAN}\\left(F, D\\_{X}, X, Y\\right) + \\lambda\\mathcal{L}\\_{cyc}\\left(G, F\\right) $$\r\n\r\nWhere we aim to solve:\r\n\r\n$$ G^{\\*}, F^{\\*} = \\arg \\min\\_{G, F} \\max\\_{D\\_{X}, D\\_{Y}} \\mathcal{L}\\_{GAN}\\left(G, F, D\\_{X}, D\\_{Y}\\right) $$\r\n\r\nFor the original architecture the authors use:\r\n\r\n-  two stride-2 convolutions, several residual blocks, and two fractionally strided convolutions with stride $\\frac{1}{2}$.\r\n- [instance normalization](https://paperswithcode.com/method/instance-normalization)\r\n- PatchGANs for the discriminator\r\n- Least Square Loss for the [GAN](https://paperswithcode.com/method/gan) objectives.",
  "title": "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "MobileNetV2",
  "full_name": "MobileNetV2",
  "description": "**MobileNetV2** is a convolutional neural network architecture that seeks to perform well on mobile devices. It is based on an inverted residual structure where the residual connections are between the bottleneck layers.  The intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity. As a whole, the architecture of MobileNetV2 contains the initial fully [convolution](https://paperswithcode.com/method/convolution) layer with 32 filters, followed by 19 residual bottleneck layers.",
  "title": "MobileNetV2: Inverted Residuals and Linear Bottlenecks",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "DPN",
  "full_name": "Dual Path Network",
  "description": "A **Dual Path Network (DPN)** is a convolutional neural network which presents a new topology of connection paths internally. The intuition is that [ResNets](https://paperswithcode.com/method/resnet) enables feature re-usage while [DenseNet](https://paperswithcode.com/method/densenet) enables new feature exploration, and both are important for learning good representations. To enjoy the benefits from both path topologies, Dual Path Networks share common features while maintaining the flexibility to explore new features through dual path architectures. \r\n\r\nWe formulate such a dual path architecture as follows:\r\n\r\n$$x^{k} = \\sum\\limits\\_{t=1}^{k-1} f\\_t^{k}(h^t) \\text{,}  $$\r\n\r\n$$\r\ny^{k} = \\sum\\limits\\_{t=1}^{k-1} v\\_t(h^t) = y^{k-1} + \\phi^{k-1}(y^{k-1}) \\text{,} \\\\\\\\\r\n$$\r\n\r\n$$\r\nr^{k} = x^{k} + y^{k} \\text{,} \\\\\\\\\r\n$$\r\n\r\n$$\r\nh^k = g^k \\left( r^{k} \\right) \\text{,}\r\n$$\r\n\r\nwhere $x^{k}$ and $y^{k}$ denote the extracted information at $k$-th step from individual path, $v_t(\\cdot)$ is a feature learning function as $f_t^k(\\cdot)$. The first equation refers to the densely connected path that enables exploring new features. The second equation refers to the residual path that enables common features re-usage. The third equation defines the dual path that integrates them and feeds them to the last transformation function in the last equation.",
  "title": "Dual Path Networks",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "SHA-RNN",
  "full_name": "Single Headed Attention RNN",
  "description": "**SHA-RNN**, or **Single Headed Attention RNN**, is a recurrent neural network, and language model when combined with an embedding input and [softmax](https://paperswithcode.com/method/softmax) classifier, based on a core [LSTM](https://paperswithcode.com/method/lstm) component and a [single-headed attention](https://paperswithcode.com/method/single-headed-attention) module. Other design choices include a Boom feedforward layer and the use of [layer normalization](https://paperswithcode.com/method/layer-normalization). The guiding principles of the author were to ensure simplicity in the architecture and to keep computational costs bounded (the model was originally trained with a single GPU).",
  "title": "Single Headed Attention RNN: Stop Thinking With Your Head",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "Social-STGCNN",
  "full_name": "Social-STGCNN",
  "description": "**Social-STGCNN** is a method for human trajectory prediction. Pedestrian trajectories are not only influenced by the pedestrian itself but also by interaction with surrounding objects.",
  "title": "Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction",
  "collection": "Trajectory Prediction Models",
  "area": "Computer Vision"
}
{
  "name": "ACER",
  "full_name": "ACER",
  "description": "**ACER**, or **Actor Critic with Experience Replay**, is an actor-critic deep reinforcement learning agent with [experience replay](https://paperswithcode.com/method/experience-replay). It can be seen as an off-policy extension of [A3C](https://paperswithcode.com/method/a3c), where the off-policy estimator is made feasible by:\r\n\r\n- Using [Retrace](https://paperswithcode.com/method/retrace) Q-value estimation.\r\n- Using truncated importance sampling with bias correction.\r\n- Using a trust region policy optimization method.\r\n- Using a [stochastic dueling network](https://paperswithcode.com/method/stochastic-dueling-network) architecture.",
  "title": "Sample Efficient Actor-Critic with Experience Replay",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "NICE-SLAM",
  "full_name": "NICE-SLAM: Neural Implicit Scalable Encoding for SLAM",
  "description": "NICE-SLAM, a dense RGB-D SLAM system that combines neural implicit decoders with hierarchical grid-based representations, which can be applied to large-scale scenes.\r\n\r\nNeural implicit representations have recently shown encouraging results in various domains, including promising progress in simultaneous localization and mapping (SLAM). Nevertheless, existing methods produce over-smoothed scene reconstructions and have difficulty scaling up to large scenes. These limitations are mainly due to their simple fully-connected network architecture that does not incorporate local information in the observations. In this paper, we present NICE-SLAM, a dense SLAM system that incorporates multi-level local information by introducing a hierarchical scene representation. Optimizing this representation with pre-trained geometric priors enables detailed reconstruction on large indoor scenes. Compared to recent neural implicit SLAM systems, our approach is more scalable, efficient, and robust. Experiments on five challenging datasets demonstrate competitive results of NICE-SLAM in both mapping and tracking quality.",
  "title": "NICE-SLAM: Neural Implicit Scalable Encoding for SLAM",
  "collection": "3D Reconstruction",
  "area": "Computer Vision"
}
{
  "name": "GCA",
  "full_name": "Graph Contrastive learning with Adaptive augmentation",
  "description": "",
  "title": "Graph Contrastive Learning with Adaptive Augmentation",
  "collection": "Graph Representation Learning",
  "area": "Graphs"
}
{
  "name": "L1 Regularization",
  "full_name": "L1 Regularization",
  "description": "**$L_{1}$ Regularization** is a regularization technique applied to the weights of a neural network. We minimize a loss function compromising both the primary loss function and a penalty on the $L\\_{1}$ Norm of the weights:\r\n\r\n$$L\\_{new}\\left(w\\right) = L\\_{original}\\left(w\\right) + \\lambda{||w||}\\_{1}$$\r\n\r\nwhere $\\lambda$ is a value determining the strength of the penalty. In contrast to [weight decay](https://paperswithcode.com/method/weight-decay), $L_{1}$ regularization promotes sparsity; i.e. some parameters have an optimal value of zero.\r\n\r\nImage Source: [Wikipedia](https://en.wikipedia.org/wiki/Regularization_(mathematics)#/media/File:Sparsityl1.png)",
  "title": null,
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "modReLU",
  "full_name": "modReLU",
  "description": "**modReLU** is an activation that is a modification of a [ReLU](https://paperswithcode.com/method/relu). It is a pointwise nonlinearity, $\\sigma\\_{modReLU}\\left(z\\right) : C \\rightarrow C$, which affects only the absolute value of a complex number, defined as:\r\n\r\n$$ \\sigma\\_{modReLU}\\left(z\\right) = \\left(|z| + b\\right)\\frac{z}{|z|} \\text{ if } |z| + b \\geq 0 $$\r\n$$ \\sigma\\_{modReLU}\\left(z\\right) = 0 \\text{ if } |z| + b \\leq 0 $$\r\n\r\nwhere $b \\in \\mathbb{R}$ is a bias parameter of the nonlinearity. For a $n\\_{h}$ dimensional hidden space we learn $n\\_{h}$ nonlinearity bias parameters, one per dimension.",
  "title": "Unitary Evolution Recurrent Neural Networks",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Mode Normalization",
  "full_name": "Mode Normalization",
  "description": "**Mode Normalization** extends normalization to more than a single mean and variance, allowing for detection of modes of data on-the-fly, jointly normalizing samples that share common features. It first assigns samples in a mini-batch to different modes via a gating network, and then normalizes each sample with estimators for its corresponding mode.",
  "title": "Mode Normalization",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "RFP",
  "full_name": "Recursive Feature Pyramid",
  "description": "An **Recursive Feature Pyramid (RFP)** builds on top of the Feature Pyramid Networks ([FPN](https://paperswithcode.com/method/fpn)) by incorporating extra feedback connections from the FPN layers into the bottom-up backbone layers. Unrolling the recursive structure to a sequential implementation, we obtain a backbone for object detector that looks at the images twice or more. Similar to the cascaded detector heads in [Cascade R-CNN](https://paperswithcode.com/method/cascade-r-cnn) trained with more selective examples, an RFP recursively enhances FPN to generate increasingly powerful representations. Resembling Deeply-Supervised Nets, the feedback connections bring the features that directly receive gradients from the detector heads back to the low levels of the bottom-up backbone to speed up training and boost performance.",
  "title": "DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution",
  "collection": "Feature Pyramid Blocks",
  "area": "Computer Vision"
}
{
  "name": "VERSE",
  "full_name": "VERtex Similarity Embeddings",
  "description": "VERtex Similarity Embeddings (VERSE) is a simple, versatile, and memory-efficient method that derives graph embeddings explicitly calibrated to preserve the distributions of a selected vertex-to-vertex similarity measure. VERSE learns such embeddings by training a single-layer neural network.\r\n\r\nSource: [Tsitsulin et al.](https://arxiv.org/pdf/1803.04742v1.pdf)\r\n\r\nImage source: [Tsitsulin et al.](https://arxiv.org/pdf/1803.04742v1.pdf)",
  "title": "VERSE: Versatile Graph Embeddings from Similarity Measures",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "ENIGMA",
  "full_name": "ENIGMA",
  "description": "**ENIGMA** is an evaluation framework for dialog systems based on Pearson and Spearman's rank correlations between the estimated rewards and the true rewards.  ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation, making automatic evaluations feasible. More importantly, ENIGMA is model-free and agnostic to the behavior policies for collecting the experience data (see details in Section 2), which significantly alleviates the technical difficulties of modeling complex dialogue environments and human behaviors.",
  "title": "Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach",
  "collection": "Dialog System Evaluation",
  "area": "Natural Language Processing"
}
{
  "name": "Voxel RoI Pooling",
  "full_name": "Voxel RoI Pooling",
  "description": "**Voxel RoI Pooling** is a RoI feature extractor extracts RoI features directly from voxel features for further refinement. It starts by dividing a region proposal into $G \\times G \\times G$ regular sub-voxels. The center point is taken as the grid point of the corresponding sub-voxel. Since $3 D$ feature volumes are extremely sparse (non-empty voxels account for $<3 \\%$ spaces), we cannot directly utilize max pooling over features of each sub-voxel. Instead, features are integrated from neighboring voxels into the grid points for feature extraction. Specifically, given a grid point $g\\_{i}$, we first exploit voxel query to group a set of neighboring voxels $\\Gamma\\_{i}=\\left\\(\\mathbf{v}\\_{i}^{1}, \\mathbf{v}\\_{i}^{2}, \\cdots, \\mathbf{v}\\_{i}^{K}\\right\\) .$ Then, we aggregate the neighboring voxel features with a [PointNet](https://paperswithcode.com/method/pointnet) module $\\mathrm{a}$ as:\r\n\r\n$$\r\n\\mathbf{\\eta}\\_{i}=\\max _{k=1,2, \\cdots, K}\\left\\(\\Psi\\left(\\left[\\mathbf{v}\\_{i}^{k}-\\mathbf{g}\\_{i} ; \\mathbf{\\phi}\\_{i}^{k}\\right]\\right)\\right\\)\r\n$$\r\n\r\nwhere $\\mathbf{v}\\_{i}-\\mathbf{g}\\_{i}$ represents the relative coordinates, $\\mathbf{\\phi}\\_{i}^{k}$ is the voxel feature of $\\mathbf{v}\\_{i}^{k}$, and $\\Psi(\\cdot)$ indicates an MLP. The [max pooling](https://paperswithcode.com/method/max-pooling) operation $\\max (\\cdot)$ is performed along the channels to obtain the aggregated feature vector $\\eta_{i} .$ Particularly, Voxel RoI pooling is exploited to extract voxel features from the 3D feature volumes out of the last two stages in the $3 \\mathrm{D}$ backbone network. And for each stage, two Manhattan distance thresholds are set to group voxels with multiple scales. Then, we concatenate the aggregated features pooled from different stages and scales to obtain the RoI features.",
  "title": "Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection",
  "collection": "RoI Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "ooJpiued",
  "full_name": "ooJpiued",
  "description": "Please enter a description about the method here",
  "title": "Dynamic Network Model from Partial Observations",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "Smooth Step",
  "full_name": "Smooth Step",
  "description": "Please enter a description about the method here",
  "title": "The Tree Ensemble Layer: Differentiability meets Conditional Computation",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Hit-Detector",
  "full_name": "Hit-Detector",
  "description": "**Hit-Detector** is a neural architectures search algorithm that simultaneously searches all components of an object detector in an end-to-end manner. It is a hierarchical approach to mine the proper subsearch space from the large volume of operation candidates. It consists of two main procedures. First, given a large search space containing all the operation candidates, we screen out the customized sub search space suitable for each part of detector with the help of group sparsity regularization. Secondly, we search the architectures for each part within the corresponding sub search space by adopting the differentiable manner.",
  "title": "Hit-Detector: Hierarchical Trinity Architecture Search for Object Detection",
  "collection": "Neural Architecture Search",
  "area": "General"
}
{
  "name": "PGC-DGCNN",
  "full_name": "PGC-DGCNN",
  "description": "PGC-DGCNN provides a new definition of graph convolutional filter. It generalizes the most commonly adopted filter, adding an hyper-parameter controlling the distance of the considered neighborhood. The model extends graph convolutions, following an intuition derived from the well-known convolutional filters over multi-dimensional tensors. The methods involves a simple, efficient and effective way to introduce a hyper-parameter on graph convolutions that influences the filter size, i.e. its receptive field over the considered graph.\r\n\r\nDescription and image from: [On Filter Size in Graph Convolutional Networks](https://arxiv.org/pdf/1811.10435.pdf)",
  "title": "On Filter Size in Graph Convolutional Networks",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "RealFormer",
  "full_name": "RealFormer",
  "description": "**RealFormer** is a type of [Transformer](https://paperswithcode.com/methods/category/transformers) based on the idea of [residual](https://paperswithcode.com/method/residual-connection) attention. It adds skip edges to the backbone [Transformer](https://paperswithcode.com/method/transformer) to create multiple direct paths, one for each type of attention module. It adds no parameters or hyper-parameters. Specifically, RealFormer uses a Post-[LN](https://paperswithcode.com/method/layer-normalization) style Transformer as backbone and adds skip edges to connect [Multi-Head Attention](https://paperswithcode.com/method/multi-head-attention) modules in adjacent layers.",
  "title": "RealFormer: Transformer Likes Residual Attention",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Zero-padded Shortcut Connection",
  "full_name": "Zero-padded Shortcut Connection",
  "description": "A **Zero-padded Shortcut Connection** is a type of [residual connection](https://paperswithcode.com/method/residual-connection) used in the [PyramidNet](https://paperswithcode.com/method/pyramidnet) architecture. For PyramidNets, identity mapping alone cannot be used for a shortcut because the feature map dimension differs among individual residual units. Therefore, only a zero-padded shortcut or projection shortcut can be used for all the residual units. However,  a projection shortcut can hamper information propagation and lead to optimization problems, especially for very deep networks. On the other hand, the zero-padded shortcut avoids the overfitting problem because no additional parameters exist.",
  "title": "Deep Pyramidal Residual Networks",
  "collection": "Skip Connections",
  "area": "General"
}
{
  "name": "ShuffleNet",
  "full_name": "ShuffleNet",
  "description": "**ShuffleNet** is a convolutional neural network designed specially for mobile devices with very limited computing power. The architecture utilizes two new operations, pointwise group [convolution](https://paperswithcode.com/method/convolution) and [channel shuffle](https://paperswithcode.com/method/channel-shuffle), to reduce computation cost while maintaining accuracy.",
  "title": "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Invertible NxN Convolution",
  "full_name": "Invertible NxN Convolution",
  "description": "",
  "title": "CInC Flow: Characterizable Invertible 3x3 Convolution",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "Phase Shuffle",
  "full_name": "Phase Shuffle",
  "description": "**Phase Shuffle** is a technique for removing pitched noise artifacts that come from using transposed convolutions in audio generation models. Phase shuffle is an operation with hyperparameter $n$. It randomly perturbs the phase of each layer’s activations by −$n$ to $n$ samples before input to the next layer.\r\n\r\nIn the original application in [WaveGAN](https://paperswithcode.com/method/wavegan), the authors only apply phase shuffle to the discriminator, as the latent vector already provides the generator a mechanism to manipulate the phase\r\nof a resultant waveform. Intuitively speaking, phase shuffle makes the discriminator’s job more challenging by requiring invariance to the phase of the input waveform.",
  "title": "Adversarial Audio Synthesis",
  "collection": "Audio Artifact Removal",
  "area": "Audio"
}
{
  "name": "Cross-View Training",
  "full_name": "Cross-View Training",
  "description": "**Cross View Training**, or **CVT**, is a semi-supervised algorithm for training distributed word representations that makes use of unlabelled and labelled examples. \r\n\r\nCVT adds $k$ auxiliary prediction modules to the model, a Bi-[LSTM](https://paperswithcode.com/method/lstm) encoder, which are used when learning on unlabeled examples. A prediction module is usually a small neural network (e.g., a hidden layer followed by a [softmax](https://paperswithcode.com/method/softmax) layer). Each one takes as input an intermediate representation $h^j(x_i)$ produced by the model (e.g., the outputs of one of the LSTMs in a Bi-LSTM model). It outputs a distribution over labels $p\\_{j}^{\\theta}\\left(y\\mid{x\\_{i}}\\right)$.\r\n\r\nEach $h^j$ is chosen such that it only uses a part of the input $x_i$; the particular choice can depend on the task and model architecture. The auxiliary prediction modules are only used during training; the test-time prediction come from the primary prediction module that produces $p_\\theta$.",
  "title": "Semi-Supervised Sequence Modeling with Cross-View Training",
  "collection": "Word Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "PRNet+",
  "full_name": "PRNet+",
  "description": "**PRNet+** is a multi-task neural network for outdoor position recovery from measurement record (MR) data. PRNet+ develops a feature extraction module to learn common local-, short- and long-term spatio-temporal locality from heterogeneous MR samples, with a convolutional neural network (CNN), long short-term memory cells ([LSTM](https://paperswithcode.com/method/lstm)), and attention mechanisms. Specifically, PRNet+ 1) allows the various-length sequences of MR samples, such that the two components (CNN and LSTM) are able to capture spatial locality from the samples within each MR sequence, 2) exploits two attention mechanisms for the time-interval between neighbouring MR samples, together with the one between neighbouring MR sequences, to capture temporal locality, and 3) incorporates the detected transportation modes and predicted locations of heterogeneous MR data into a joint loss for better result.",
  "title": "Outdoor Position Recovery from HeterogeneousTelco Cellular Data",
  "collection": "Position Recovery Models",
  "area": "General"
}
{
  "name": "FGA",
  "full_name": "Factor Graph Attention",
  "description": "A general multimodal attention unit for any number of modalities. Graphical models inspire it, i.e., it infers several attention beliefs via aggregated interaction messages.",
  "title": "Factor Graph Attention",
  "collection": "Attention",
  "area": "General"
}
{
  "name": "3D Dynamic Scene Graph",
  "full_name": "3D Dynamic Scene Graph",
  "description": "**3D Dynamic Scene Graph**, or **DSG**, is a representation that captures metric and semantic aspects of a dynamic environment. A DSG is a layered graph where nodes represent spatial concepts at different levels of abstraction, and edges represent spatio-temporal relations among nodes.",
  "title": "Kimera: from SLAM to Spatial Perception with 3D Dynamic Scene Graphs",
  "collection": "3D Representations",
  "area": "Computer Vision"
}
{
  "name": "MIM",
  "full_name": "Mutual Information Machine/Mask Image Modeling",
  "description": "",
  "title": "MIM: Mutual Information Machine",
  "collection": "Representation Learning",
  "area": "General"
}
{
  "name": "MaxUp",
  "full_name": "MaxUp",
  "description": "**MaxUp** is an adversarial data augmentation technique for improving the generalization performance of machine learning models. The idea is to generate a set of augmented data with some random perturbations or transforms, and minimize the maximum, or worst case loss over the augmented data.  By doing so, we implicitly introduce a smoothness or robustness regularization against the random perturbations, and hence improve the generation performance.  For example, in the case of Gaussian perturbation, MaxUp is asymptotically equivalent to using the gradient norm of the loss as a penalty to encourage smoothness.",
  "title": "MaxUp: A Simple Way to Improve Generalization of Neural Network Training",
  "collection": "Adversarial Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "Pointer Network",
  "full_name": "Pointer Network",
  "description": "**Pointer Networks** tackle problems where input and output data are sequential data, but can't be solved by seq2seq type models because discrete categories of output elements depend on the variable input size (and are not decided in advance).\r\n\r\nA Pointer Network learns the conditional  probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. They solve the problem of variable size output dictionaries using [additive attention](https://paperswithcode.com/method/additive-attention). But instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, Pointer Networks use attention as a pointer to select a member of the input sequence as the output. \r\n\r\nPointer-Nets can be used to learn approximate solutions to challenging geometric problems such as finding planar convex hulls, computing Delaunay triangulations, and the planar Travelling Salesman Problem.",
  "title": "Pointer Networks",
  "collection": "Sequence To Sequence Models",
  "area": "Sequential"
}
{
  "name": "scSE",
  "full_name": "Spatial and Channel SE Blocks",
  "description": "To aggregate global spatial information,\r\nan SE block applies global pooling to the feature map.\r\nHowever, it ignores pixel-wise spatial information,\r\nwhich is important in dense prediction tasks.\r\nTherefore, Roy et al. proposed\r\nspatial and channel SE blocks (scSE). \r\nLike BAM, spatial SE blocks are used, complementing SE blocks, \r\nto provide spatial attention weights to focus on important regions.\r\n\r\nGiven the input feature map $X$, two parallel modules, spatial SE and channel SE, are applied to feature maps to encode spatial and channel information respectively. The channel SE module is an ordinary SE block, while the spatial SE module adopts $1\\times 1$ convolution for spatial squeezing. The outputs from the two modules are fused. The overall process can be written as\r\n\\begin{align}\r\n    s_c & = \\sigma (W_{2} \\delta (W_{1}\\text{GAP}(X)))\r\n\\end{align}\r\n\\begin{align}\r\n    X_\\text{chn} & = s_c  X \r\n\\end{align}\r\n\\begin{align}\r\n    s_s &= \\sigma(\\text{Conv}^{1\\times 1}(X))\r\n\\end{align}\r\n\\begin{align}\r\n    X_\\text{spa} & = s_s  X\r\n\\end{align}\r\n\\begin{align}\r\n    Y &= f(X_\\text{spa},X_\\text{chn})  \r\n\\end{align}\r\n\r\nwhere $f$ denotes the fusion function, which can be  maximum, addition, multiplication or concatenation. \r\n\r\nThe proposed scSE block combines channel and spatial attention to\r\nenhance features as well as \r\ncapturing pixel-wise spatial information.\r\nSegmentation tasks are greatly benefited as a result.\r\nThe integration of an scSE block in F-CNNs makes a consistent improvement in semantic segmentation at negligible extra cost.",
  "title": "Recalibrating Fully Convolutional Networks with Spatial and Channel 'Squeeze & Excitation' Blocks",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Random Search",
  "full_name": "Random Search",
  "description": "**Random Search** replaces the exhaustive enumeration of all combinations by selecting them randomly. This can be simply applied to the discrete setting described above, but also generalizes to continuous and mixed spaces. It can outperform Grid search, especially when only a small number of hyperparameters affects the final performance of the machine learning algorithm. In this case, the optimization problem is said to have a low intrinsic dimensionality. Random Search is also embarrassingly parallel, and additionally allows the inclusion of prior knowledge by specifying the distribution from which to sample.\r\n\r\n\r\nExtracted from [Wikipedia](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Random_search)\r\n\r\nSource [Paper](https://dl.acm.org/doi/10.5555/2188385.2188395)\r\n\r\nImage Source: [BERGSTRA AND BENGIO](https://dl.acm.org/doi/pdf/10.5555/2188385.2188395)",
  "title": null,
  "collection": "Hyperparameter Search",
  "area": "General"
}
{
  "name": "Pose Contrastive Learning",
  "full_name": "Self-Supervised Cross View Cross Subject Pose Contrastive Learning",
  "description": "Please enter a description about the method here",
  "title": "Domain Knowledge-Informed Self-Supervised Representations for Workout Form Assessment",
  "collection": "Pose Estimation Models",
  "area": "Computer Vision"
}
{
  "name": "Euclidean Norm Regularization",
  "full_name": "Euclidean Norm Regularization",
  "description": "**Euclidean Norm Regularization** is a regularization step used in [generative adversarial networks](https://paperswithcode.com/methods/category/generative-adversarial-networks), and is typically added to both the generator and discriminator losses:\r\n\r\n$$ R\\_{z} = w\\_{r} \\cdot ||\\Delta{z}||^{2}\\_{2} $$\r\n\r\nwhere the scalar weight $w\\_{r}$ is a parameter.\r\n\r\nImage: [LOGAN](https://paperswithcode.com/method/logan)",
  "title": "Deep Compressed Sensing",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Patch Merger",
  "full_name": "Patch Merger Module",
  "description": "PatchMerger is a module for Vision Transformers that decreases the number of tokens/patches passed onto each individual transformer encoder block whilst maintaining performance and reducing compute. PatchMerger takes linearly transforms an input of shape N patches × D dimensions through a learnable weight matrix of shape M output patches × D. This generates M scores, in which a Softmax function is applied for each score. The resulting output has a shape of M × N, which is multiplied to the original input to get an output of shape M × D.\r\n\r\nMathematically, $$Y = \\text{softmax}({W^T}{X^T})X$$\r\n\r\nImage and formula from: Renggli, C., Pinto, A. S., Houlsby, N., Mustafa, B., Puigcerver, J., & Riquelme, C. (2022). Learning to Merge Tokens in Vision Transformers. arXiv preprint arXiv:2202.12015.",
  "title": "Learning to Merge Tokens in Vision Transformers",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "MixConv",
  "full_name": "Mixed Depthwise Convolution",
  "description": "**MixConv**, or **Mixed Depthwise Convolution**, is a type of [depthwise convolution](https://paperswithcode.com/method/depthwise-convolution) that naturally mixes up multiple kernel sizes in a single [convolution](https://paperswithcode.com/method/convolution). It is based on the insight that depthwise convolution applies a single kernel size to all channels, which MixConv overcomes by combining the benefits of multiple kernel sizes. It does this by partitioning channels into groups and applying a different kernel size to each group.",
  "title": "MixConv: Mixed Depthwise Convolutional Kernels",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "TE2Rules",
  "full_name": "Tree Ensemble to Rules",
  "description": "A method to convert a Tree Ensemble model into a Rule list. This makes the AI model more transparent.",
  "title": "TE2Rules: Explaining Tree Ensembles using Rules",
  "collection": "Interpretability",
  "area": "General"
}
{
  "name": "Pixel Tracking",
  "full_name": "Pixel Tracking",
  "description": "",
  "title": "What's in the Flow? Exploiting Temporal Motion Cues for Unsupervised Generic Event Boundary Detection",
  "collection": "Heuristic Search Algorithms",
  "area": "Reinforcement Learning"
}
{
  "name": "MeRL",
  "full_name": "Meta Reward Learning",
  "description": "**Meta Reward Learning (MeRL)** is a meta-learning method for the problem of learning from sparse and underspecified rewards. For example, an agent receives a complex input, such as a natural language instruction, and needs to generate a complex response, such as an action sequence, while only receiving binary success-failure feedback. The key insight of MeRL in dealing with underspecified rewards is that spurious trajectories and programs that achieve accidental success are detrimental to the agent's generalization performance. For example, an agent might be able to solve a specific instance of the maze problem above. However, if it learns to perform spurious actions during training, it is likely to fail when provided with unseen instructions. To mitigate this issue, MeRL optimizes a more refined auxiliary reward function, which can differentiate between accidental and purposeful success based on features of action trajectories. The auxiliary reward is optimized by maximizing the trained agent's performance on a hold-out validation set via meta learning.",
  "title": "Learning to Generalize from Sparse and Underspecified Rewards",
  "collection": "Meta-Learning Algorithms",
  "area": "General"
}
{
  "name": "TrOCR",
  "full_name": "TrOCR",
  "description": "**TrOCR** is an end-to-end [Transformer](https://paperswithcode.com/methods/category/transformers)-based OCR model for text recognition with pre-trained CV and NLP models. It leverages the [Transformer](https://paperswithcode.com/method/transformer) architecture for both image understanding and wordpiece-level text generation. It first resizes the input text image into $384 × 384$ and then the image is split into a sequence of 16 patches which are used as the input to image Transformers.  Standard Transformer architecture with the [self-attention mechanism](https://paperswithcode.com/method/scaled) is leveraged on both encoder and decoder parts, where wordpiece units are generated as the recognized text from the input image.",
  "title": "TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models",
  "collection": "OCR Models",
  "area": "Computer Vision"
}
{
  "name": "RegionViT",
  "full_name": "RegionViT",
  "description": "**RegionViT** consists of two tokenization processes that convert an image into regional (upper path) and local tokens (lower path). Each tokenization is a convolution with different patch sizes, the patch size of regional tokens is $28^2$ while $4^2$ is used for local tokens with dimensions projected to $C$, which means that one regional token covers $7^2$ local tokens based on the spatial locality, leading to the window size of a local region to $7^2$. At stage 1, two set of tokens are passed through the proposed regional-to-local transformer encoders. However, for the later stages, to balance the computational load and to have feature maps at different resolution, the approach uses a downsampling process to halve the spatial resolution while doubling the channel dimension like CNN on both regional and local tokens before going to the next stage. Finally, at the end of the network, it simply averages the remaining regional tokens as the final embedding for the classification while the detection uses all local tokens at each stage since it provides more fine-grained location information. By having the pyramid structure, the ViT can generate multi-scale features and hence it could be easily extended to more vision applications, e.g., object detection, rather than image classification only.",
  "title": "RegionViT: Regional-to-Local Attention for Vision Transformers",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "Seq2Edits",
  "full_name": "Seq2Edits",
  "description": "**Seq2Edits** is an open-vocabulary approach to sequence editing for natural language processing (NLP) tasks with a high degree of overlap between input and output texts. In this approach, each sequence-to-sequence transduction is represented as a sequence of edit operations, where each operation either replaces an entire source span with target tokens or keeps it unchanged. For text normalization, sentence fusion, sentence splitting & rephrasing, text simplification, and grammatical error correction, the approach improves explainability by associating each edit operation with a human-readable tag.\r\n\r\nRather than generating the target sentence as a series of tokens, the model predicts a sequence of edit operations that, when applied to the source sentence, yields the target sentence. Each edit operates on a span in the source sentence and either copies, deletes, or replaces it with one or more target tokens. Edits are generated auto-regressively from left to right using a modified [Transformer](https://paperswithcode.com/method/transformer) architecture to facilitate learning of long-range dependencies.",
  "title": "Seq2Edits: Sequence Transduction Using Span-level Edit Operations",
  "collection": "Sequence Editing Models",
  "area": "Natural Language Processing"
}
{
  "name": "MPNN",
  "full_name": "Message Passing Neural Network",
  "description": "There are at least eight notable examples of models from the literature that can be described using the **Message Passing Neural Networks** (**MPNN**) framework. For simplicity we describe MPNNs which operate on undirected graphs $G$ with node features $x_{v}$ and edge features $e_{vw}$. It is trivial to extend the formalism to directed multigraphs. The forward pass has two phases, a message passing phase and a readout phase. The message passing phase runs for $T$ time steps and is defined in terms of message functions $M_{t}$ and vertex update functions $U_{t}$. During the message passing phase, hidden states $h_{v}^{t}$ at each node in the graph are updated based on messages $m_{v}^{t+1}$ according to\r\n$$\r\nm_{v}^{t+1} = \\sum_{w \\in N(v)} M_{t}(h_{v}^{t}, h_{w}^{t}, e_{vw})\r\n$$\r\n$$\r\nh_{v}^{t+1} = U_{t}(h_{v}^{t}, m_{v}^{t+1})\r\n$$\r\nwhere in the sum, $N(v)$ denotes the neighbors of $v$ in graph $G$. The readout phase computes a feature vector for the whole graph using some readout function $R$ according to\r\n$$\r\n\\hat{y} = R(\\\\{ h_{v}^{T} | v \\in G \\\\})\r\n$$\r\nThe message functions $M_{t}$, vertex update functions $U_{t}$, and readout function $R$ are all learned differentiable functions. $R$ operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism.",
  "title": "Neural Message Passing for Quantum Chemistry",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "PixelRNN",
  "full_name": "Pixel Recurrent Neural Network",
  "description": "**PixelRNNs** are generative neural networks that sequentially predicts the pixels in an image along the two spatial dimensions. They model the discrete probability of the raw pixel values and encode the complete set of dependencies in the image. Variants include the Row [LSTM](https://paperswithcode.com/method/lstm) and the Diagonal [BiLSTM](https://paperswithcode.com/method/bilstm), that scale more easily to larger datasets. Pixel values are treated as discrete random variables by using a [softmax](https://paperswithcode.com/method/softmax) layer in the conditional distributions. Masked convolutions are employed to allow PixelRNNs to model full dependencies between the color channels.",
  "title": "Pixel Recurrent Neural Networks",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "YOLOv3",
  "full_name": "YOLOv3",
  "description": "**YOLOv3** is a real-time, single-stage object detection model that builds on [YOLOv2](https://paperswithcode.com/method/yolov2) with several improvements. Improvements include the use of a new backbone network, [Darknet-53](https://paperswithcode.com/method/darknet-53) that utilises residual connections, or in the words of the author, \"those newfangled residual network stuff\", as well as some improvements to the bounding box prediction step, and use of three different scales from which to extract features (similar to an [FPN](https://paperswithcode.com/method/fpn)).",
  "title": "YOLOv3: An Incremental Improvement",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "FBNet Block",
  "full_name": "FBNet Block",
  "description": "**FBNet Block** is an image model block used in the [FBNet](https://paperswithcode.com/method/fbnet) architectures discovered through [DNAS](https://paperswithcode.com/method/dnas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search). The basic building blocks employed are [depthwise convolutions](https://paperswithcode.com/method/depthwise-convolution) and a [residual connection](https://paperswithcode.com/method/residual-connection).",
  "title": "FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Randomized Smoothing",
  "full_name": "Randomized Smoothing",
  "description": "",
  "title": "Certified Adversarial Robustness via Randomized Smoothing",
  "collection": "Robustness Methods",
  "area": "General"
}
{
  "name": "RegNetY",
  "full_name": "RegNetY",
  "description": "**RegNetY** is a convolutional network design space with simple, regular models with parameters: depth $d$, initial width $w\\_{0} > 0$, and slope $w\\_{a} > 0$, and generates a different block width $u\\_{j}$ for each block $j < d$. The key restriction for the RegNet types of model is that there is a linear parameterisation of block widths (the design space only contains models with this linear structure):\r\n\r\n$$ u\\_{j} = w\\_{0} + w\\_{a}\\cdot{j} $$\r\n\r\nFor **RegNetX** we have additional restrictions: we set $b = 1$ (the bottleneck ratio), $12 \\leq d \\leq 28$, and $w\\_{m} \\geq 2$ (the width multiplier).\r\n\r\nFor **RegNetY** we make one change, which is to include Squeeze-and-Excitation blocks.",
  "title": "Designing Network Design Spaces",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "RevSilo",
  "full_name": "RevSilo",
  "description": "Invertible multi-input multi-output coupling module. In RevBiFPN it is used as a bidirectional multi-scale feature pyramid fusion module that is invertible.",
  "title": "RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network",
  "collection": "Reversible Image Conversion Models",
  "area": "Computer Vision"
}
{
  "name": "Affine Coupling",
  "full_name": "Affine Coupling",
  "description": "**Affine Coupling** is a method for implementing a normalizing flow (where we stack a sequence of invertible bijective transformation functions). Affine coupling is one of these bijective transformation functions. Specifically, it is an example of a reversible transformation where the forward function, the reverse function and the log-determinant are computationally efficient. For the forward function, we split the input dimension into two parts:\r\n\r\n$$ \\mathbf{x}\\_{a}, \\mathbf{x}\\_{b} = \\text{split}\\left(\\mathbf{x}\\right) $$\r\n\r\nThe second part stays the same $\\mathbf{x}\\_{b} = \\mathbf{y}\\_{b}$, while the first part  $\\mathbf{x}\\_{a}$ undergoes an affine transformation, where the parameters for this transformation are learnt using the second part $\\mathbf{x}\\_{b}$ being put through a neural network. Together we have:\r\n\r\n$$ \\left(\\log{\\mathbf{s}, \\mathbf{t}}\\right) = \\text{NN}\\left(\\mathbf{x}\\_{b}\\right) $$\r\n\r\n$$ \\mathbf{s} = \\exp\\left(\\log{\\mathbf{s}}\\right) $$\r\n\r\n$$ \\mathbf{y}\\_{a} = \\mathbf{s} \\odot \\mathbf{x}\\_{a} + \\mathbf{t}  $$\r\n\r\n$$ \\mathbf{y}\\_{b} = \\mathbf{x}\\_{b} $$\r\n\r\n$$ \\mathbf{y} = \\text{concat}\\left(\\mathbf{y}\\_{a}, \\mathbf{y}\\_{b}\\right) $$\r\n\r\nImage: [GLOW](https://paperswithcode.com/method/glow)",
  "title": "NICE: Non-linear Independent Components Estimation",
  "collection": "Bijective Transformation",
  "area": "General"
}
{
  "name": "ASPP",
  "full_name": "Atrous Spatial Pyramid Pooling",
  "description": "**Atrous Spatial Pyramid Pooling (ASPP)** is a semantic segmentation module for resampling a given feature layer at multiple rates prior to [convolution](https://paperswithcode.com/method/convolution). This amounts to probing the original image with multiple filters that have complementary effective fields of view, thus capturing objects as well as useful image context at multiple scales. Rather than actually resampling features, the mapping is implemented using multiple parallel atrous convolutional layers with different sampling rates.",
  "title": "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs",
  "collection": "Semantic Segmentation Modules",
  "area": "Computer Vision"
}
{
  "name": "Boom Layer",
  "full_name": "Boom Layer",
  "description": "A **Boom Layer** is a type of feedforward layer that is closely related to the feedforward layers used in Transformers. The layer takes a vector of the form $v \\in \\mathbb{R}^{H}$ and uses a matrix\r\nmultiplication with a GeLU activation to produce a vector $u \\in \\mathbb{R}^{N\\times{H}}$. We then break $u$ into $N$ vectors and sum those together, producing $w \\in \\mathbb{R}^{H}$. This minimizes computation and removes an entire matrix of parameters compared to traditional down-projection layers.\r\n\r\nThe Figure to the right shows the Boom Layer used in the context of [SHA-RNN](https://paperswithcode.com/method/sha-rnn) from the original paper.",
  "title": "Single Headed Attention RNN: Stop Thinking With Your Head",
  "collection": "Feedforward Networks",
  "area": "General"
}
{
  "name": "LSGAN",
  "full_name": "LSGAN",
  "description": "**LSGAN**, or **Least Squares GAN**, is a type of generative adversarial network that adopts the least squares loss function for the discriminator. Minimizing the objective function of LSGAN yields minimizing the Pearson $\\chi^{2}$ divergence. The objective function can be defined as:\r\n\r\n$$ \\min\\_{D}V\\_{LSGAN}\\left(D\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{x} \\sim p\\_{data}\\left(\\mathbf{x}\\right)}\\left[\\left(D\\left(\\mathbf{x}\\right) - b\\right)^{2}\\right] + \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z}\\sim p\\_{\\mathbf{z}}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - a\\right)^{2}\\right] $$\r\n\r\n$$ \\min\\_{G}V\\_{LSGAN}\\left(G\\right) = \\frac{1}{2}\\mathbb{E}\\_{\\mathbf{z} \\sim p\\_{\\mathbf{z}}\\left(\\mathbf{z}\\right)}\\left[\\left(D\\left(G\\left(\\mathbf{z}\\right)\\right) - c\\right)^{2}\\right] $$\r\n\r\nwhere $a$ and $b$ are the labels for fake data and real data and $c$ denotes the value that $G$ wants $D$ to believe for fake data.",
  "title": "Least Squares Generative Adversarial Networks",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Self-Adjusting Smooth L1 Loss",
  "full_name": "Self-Adjusting Smooth L1 Loss",
  "description": "**Self-Adjusting Smooth L1 Loss** is a loss function used in object detection that was introduced with [RetinaMask](https://paperswithcode.com/method/retinamask). This is an improved version of Smooth L1.  For Smooth L1 loss we have:\r\n\r\n$$ f(x) = 0.5  \\frac{x^{2}}{\\beta} \\text{ if } |x| < \\beta $$\r\n$$ f(x) = |x| -0.5\\beta \\text{ otherwise } $$\r\n\r\nHere a point $\\beta$ splits the positive axis range into two parts: $L2$ loss is used for targets in range $[0, \\beta]$, and $L1$ loss is used beyond $\\beta$ to avoid over-penalizing  utliers. The overall function is smooth (continuous, together with its derivative). However, the choice of control point ($\\beta$) is heuristic and is usually done by hyper parameter search.\r\n\r\nInstead, with self-adjusting smooth L1 loss, inside the loss function the running mean and variance of the absolute loss are recorded. We use the running minibatch mean and variance with a momentum of $0.9$ to update these two parameters.",
  "title": "RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "Detr",
  "full_name": "Detection Transformer",
  "description": "**Detr**, or **Detection Transformer**, is a set-based object detector using a [Transformer](https://paperswithcode.com/method/transformer) on top of a convolutional backbone. It uses a conventional CNN backbone to learn a 2D representation of an input image. The model flattens it and supplements it with a positional encoding before passing it into a transformer encoder. A transformer decoder then takes as input a small fixed number of learned positional embeddings, which we call object queries, and additionally attends to the encoder output. We pass each output embedding of the decoder to a shared feed forward network (FFN) that predicts either a detection (class\r\nand bounding box) or a “no object” class.",
  "title": "End-to-End Object Detection with Transformers",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "Firefly algorithm",
  "full_name": "Firefly algorithm",
  "description": "Metaheuristic algorithm",
  "title": "Firefly Algorithm for optimization problems with non-continuous variables: A Review and Analysis",
  "collection": "Heuristic Search Algorithms",
  "area": "Reinforcement Learning"
}
{
  "name": "Residual Normal Distribution",
  "full_name": "Residual Normal Distribution",
  "description": "**Residual Normal Distributions** are used to help the optimization of VAEs, preventing optimization from entering an unstable region. This can happen due to sharp gradients caused in situations where the encoder and decoder produce distributions far away from each other. The residual distribution parameterizes $q\\left(\\mathbf{z}|\\mathbf{x}\\right)$ relative to $p\\left(\\mathbf{z}\\right)$. Let $p\\left(z^{i}\\_{l}|\\mathbf{z}\\_{<l}\\right) := N \\left(\\mu\\_{i}\\left(\\mathbf{z}\\_{<l}\\right), \\sigma\\_{i}\\left(\\mathbf{z}\\_{<l}\\right)\\right)$ be a Normal distribution for the $i$th variable in $\\mathbf{z}\\_{l}$ in prior. Define $q\\left(z^{i}\\_{l}|\\mathbf{z}\\_{<l}, x\\right) := N\\left(\\mu\\_{i}\\left(\\mathbf{z}\\_{<l}\\right) + \\Delta\\mu\\_{i}\\left(\\mathbf{z}\\_{<l}, x\\right), \\sigma\\_{i}\\left(\\mathbf{z}\\_{<l}\\right) \\cdot \\Delta\\sigma\\_{i}\\left(\\mathbf{z}\\_{<l}, x\\right) \\right)$, where $\\Delta\\mu\\_{i}\\left(\\mathbf{z}\\_{<l}, \\mathbf{x}\\right)$ and $\\Delta\\sigma\\_{i}\\left(\\mathbf{z}\\_{<l}, \\mathbf{x}\\right)$ are the relative location and scale of the approximate posterior with respect to the prior. With this parameterization, when the prior moves, the approximate posterior moves accordingly, if not changed.",
  "title": "NVAE: A Deep Hierarchical Variational Autoencoder",
  "collection": "Variational Optimization",
  "area": "General"
}
{
  "name": "R1 Regularization",
  "full_name": "R1 Regularization",
  "description": "**R_INLINE_MATH_1 Regularization** is a regularization technique and gradient penalty for training [generative adversarial networks](https://paperswithcode.com/methods/category/generative-adversarial-networks). It penalizes the discriminator from deviating from the Nash Equilibrium via penalizing the gradient on real data alone: when the generator distribution produces the true data distribution and the discriminator is equal to 0 on the data manifold, the gradient penalty ensures that the discriminator cannot create a non-zero gradient orthogonal to the data manifold without suffering a loss in the [GAN](https://paperswithcode.com/method/gan) game.\r\n\r\nThis leads to the following regularization term:\r\n\r\n$$ R\\_{1}\\left(\\psi\\right) = \\frac{\\gamma}{2}E\\_{p\\_{D}\\left(x\\right)}\\left[||\\nabla{D\\_{\\psi}\\left(x\\right)}||^{2}\\right] $$",
  "title": "Which Training Methods for GANs do actually Converge?",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Grammatical evolution + Q-learning",
  "full_name": "Grammatical evolution and Q-learning",
  "description": "This method works as a two-levels optimization algorithm.\r\nThe outmost layer uses Grammatical evolution to evolve a grammar to build the agent.\r\nThen, [Q-learning](https://paperswithcode.com/method/q-learning) is used the fitness evaluation phase to allow the agent to learn to perform online learning.",
  "title": "Evolutionary learning of interpretable decision trees",
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "ALBEF",
  "full_name": "ALBEF",
  "description": "ALBEF introduces a contrastive loss to align the image and text representations before fusing them through cross-modal attention. This enables more grounded vision and language representation learning. ALBEF also doesn't require bounding box annotations. The model consists of an image encode, a text encoder, and a multimodal encoder. The image-text contrastive loss helps to align the unimodal representations of an image-text pair before fusion. The image-text matching loss and a masked language modeling loss are applied to learn multimodal interactions between image and text. In addition, momentum distillation is used to generate pseudo-targets. This improves learning with noisy data.",
  "title": "Align before Fuse: Vision and Language Representation Learning with Momentum Distillation",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "MPCK-Means",
  "full_name": "Metric Pairwise Constrained KMeans",
  "description": "Original paper : Integrating Constraints and Metric Learning in Semi-Supervised Clustering, Bilenko et al. 2004",
  "title": null,
  "collection": "Clustering",
  "area": "General"
}
{
  "name": "Go-Explore",
  "full_name": "Go-Explore",
  "description": "**Go-Explore** is a family of algorithms aiming to tackle two challenges with effective exploration in reinforcement learning: algorithms forgetting how to reach previously visited states (\"detachment\") and from failing to first return to a state before exploring from it (\"derailment\").\r\n\r\nTo avoid detachment, Go-Explore builds an archive of the different states it has visited in the environment, thus ensuring that states cannot be forgotten. Starting with an archive beginning with the initial state, the archive is built iteratively. In Go-Explore we:\r\n\r\n(a) Probabilistically select a state from the archive, preferring states associated with promising cells. \r\n\r\n(b) Return to the selected state, such as by restoring simulator state or by running a goal-conditioned policy. \r\n\r\n(c) Explore from that state by taking random actions or sampling from a trained policy. \r\n\r\n(d) Map every state encountered during returning and exploring to a low-dimensional cell representation. \r\n\r\n(e) Add states that map to new cells to the archive and update other archive entries.",
  "title": "Go-Explore: a New Approach for Hard-Exploration Problems",
  "collection": "Behaviour Policies",
  "area": "Reinforcement Learning"
}
{
  "name": "ViLBERT",
  "full_name": "Vision-and-Language BERT",
  "description": "**Vision-and-Language BERT** (**ViLBERT**) is a [BERT](https://paperswithcode.com/method/bert)-based model for learning task-agnostic joint representations of image content and natural language. ViLBERT extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional [transformer](https://paperswithcode.com/method/transformer) layers.",
  "title": "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks",
  "collection": "Representation Learning",
  "area": "General"
}
{
  "name": "BoundaryNet",
  "full_name": "BoundaryNet",
  "description": "**BoundaryNet** is a resizing-free approach for layout annotation. The variable-sized user selected region of interest is first processed by an attention-guided skip network. The network optimization is guided via Fast Marching distance maps to obtain a good quality initial boundary estimate and an associated feature representation. These outputs are processed by a Residual Graph [Convolution](https://paperswithcode.com/method/convolution) Network optimized using Hausdorff loss to obtain the final region boundary.",
  "title": "BoundaryNet: An Attentive Deep Network with Fast Marching Distance Maps for Semi-automatic Layout Annotation",
  "collection": "Layout Annotation Models",
  "area": "Computer Vision"
}
{
  "name": "GLN",
  "full_name": "Gated Linear Network",
  "description": "A **Gated Linear Network**, or **GLN**, is a type of backpropagation-free neural architecture. What distinguishes GLNs from contemporary neural networks is the distributed and local nature of their credit assignment mechanism; each neuron directly predicts the target, forgoing the ability to learn feature representations in favor of rapid online learning. Individual neurons can model nonlinear functions via the use of data-dependent gating in conjunction with online convex optimization. \r\n\r\nGLNs are feedforward networks composed of many layers of gated geometric mixing neurons as shown in the Figure . Each neuron in a given layer outputs a gated geometric mixture of the predictions from the previous layer, with the final layer consisting of just a single neuron. In a supervised learning setting, a $\\mathrm{GLN}$ is trained on (side information, base predictions, label) triplets $\\left(z\\_{t}, p\\_{t}, x\\_{t}\\right)_{t=1,2,3, \\ldots}$ derived from input-label pairs $\\left(z\\_{t}, x\\_{t}\\right)$. There are two types of input to neurons in the network: the first is the side information $z\\_{t}$, which can be thought of as the input features; the second is the input to the neuron, which will be the predictions output by the previous layer, or in the case of layer 0 , some (optionally) provided base predictions $p\\_{t}$ that typically will be a function of $z\\_{t} .$ Each neuron will also take in a constant bias prediction, which helps empirically and is essential for universality guarantees.\r\n\r\nWeights are learnt in a Gated Linear Network using Online Gradient Descent (OGD) locally at each neuron. They key observation is that as each neuron $(i, k)$ in layers $i>0$ is itself a gated geometric mixture, all of these neurons can be thought of as individually predicting the target. Given side information $z$ , each neuron $(i, k)$ suffers a loss convex in its active weights $u:=w\\_{i k c\\_{i k}(z)}$ of\r\n$$\r\n\\ell\\_{t}(u):=-\\log \\left(\\operatorname{GEO}\\_{u}\\left(x_{t} ; p\\_{i-1}\\right)\\right)\r\n$$",
  "title": "Gated Linear Networks",
  "collection": "Gated Linear Networks",
  "area": "General"
}
{
  "name": "WordPiece",
  "full_name": "WordPiece",
  "description": "**WordPiece** is a subword segmentation algorithm used in natural language processing.  The vocabulary is initialized with individual characters in the language, then the most frequent combinations of symbols in the vocabulary are iteratively added to the vocabulary. The process is:\r\n\r\n1. Initialize the word unit inventory with all the characters in the text.\r\n2. Build a language model on the training data using the inventory from 1.\r\n3. Generate a new word unit by combining two units out of the current word inventory to increment the word unit inventory by one. Choose the new word unit out of all the possible ones that increases the likelihood on the training data the most when added to the model.\r\n4. Goto 2 until a predefined limit of word units is reached or the likelihood increase falls below a certain threshold.\r\n\r\nText: [Source](https://stackoverflow.com/questions/55382596/how-is-wordpiece-tokenization-helpful-to-effectively-deal-with-rare-words-proble/55416944#55416944)\r\n\r\nImage: WordPiece as used in [BERT](https://paperswithcode.com/method/bert)",
  "title": "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation",
  "collection": "Subword Segmentation",
  "area": "Natural Language Processing"
}
{
  "name": "Gradient Sparsification",
  "full_name": "Gradient Sparsification",
  "description": "**Gradient Sparsification** is a technique for distributed training that sparsifies stochastic gradients to reduce the communication cost, with minor increase in the number of iterations. The key idea behind our sparsification technique is to drop some coordinates of the stochastic gradient and appropriately amplify the remaining coordinates to ensure the unbiasedness of the sparsified stochastic gradient. The sparsification approach can significantly reduce the coding length of the stochastic gradient and only slightly increase the variance of the stochastic gradient.",
  "title": "Gradient Sparsification for Communication-Efficient Distributed Optimization",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "PanGu-$α$",
  "full_name": "PanGu-$α$",
  "description": "**PanGu-$α$** is an autoregressive language model (ALM) with up to 200 billion parameters pretrained on a large corpus of text, mostly in Chinese language. The architecture of PanGu-$α$ is based on Transformer, which has been extensively used as the backbone of a variety of pretrained language models such as [BERT](https://paperswithcode.com/method/bert) and [GPT](https://paperswithcode.com/method/gpt). Different from them, there's an additional query layer developed on top of Transformer layers which aims to explicitly induce the expected output.",
  "title": "PanGu-$α$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "LAMA",
  "full_name": "Low-Rank Factorization-based Multi-Head Attention",
  "description": "**Low-Rank Factorization-based Multi-head Attention Mechanism**, or **LAMA**, is a type of attention module that uses low-rank factorization to reduce computational complexity. It uses low-rank bilinear pooling to construct a structured sentence representation that attends to multiple aspects of a sentence.",
  "title": "Low Rank Factorization for Compact Multi-Head Self-Attention",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "CBHG",
  "full_name": "CBHG",
  "description": "**CBHG** is a building block used in the [Tacotron](https://paperswithcode.com/method/tacotron) text-to-speech model. It consists of a bank of 1-D convolutional filters, followed by highway networks and a bidirectional gated recurrent unit ([BiGRU](https://paperswithcode.com/method/bigru)). \r\n\r\nThe module is used to extract representations from sequences. The input sequence is first\r\nconvolved with $K$ sets of 1-D convolutional filters, where the $k$-th set contains $C\\_{k}$ filters of width $k$ (i.e. $k = 1, 2, \\dots , K$). These filters explicitly model local and contextual information (akin to modeling unigrams, bigrams, up to K-grams). The [convolution](https://paperswithcode.com/method/convolution) outputs are stacked together and further max pooled along time to increase local invariances. A stride of 1 is used to  preserve the original time resolution. The processed sequence is further passed to a few fixed-width 1-D convolutions, whose outputs are added with the original input sequence via residual connections. [Batch normalization](https://paperswithcode.com/method/batch-normalization) is used for all convolutional layers. The convolution outputs are fed into a multi-layer [highway network](https://paperswithcode.com/method/highway-network) to extract high-level features. Finally, a bidirectional [GRU](https://paperswithcode.com/method/gru) RNN is stacked on top to extract sequential features from both forward and backward context.",
  "title": "Tacotron: Towards End-to-End Speech Synthesis",
  "collection": "Speech Synthesis Blocks",
  "area": "Audio"
}
{
  "name": "GAN-TTS",
  "full_name": "GAN-TTS",
  "description": "**GAN-TTS** is a generative adversarial network for text-to-speech synthesis. The architecture is composed of a conditional feed-forward generator producing raw speech audio, and an ensemble of discriminators which operate on random windows of different sizes. The discriminators analyze the audio both in terms of general realism, as well as how well the audio corresponds to the utterance that should be pronounced.\r\n\r\nThe generator architecture consists of several GBlocks, which are residual based (dilated) [convolution](https://paperswithcode.com/method/convolution) blocks. GBlocks 3–7 gradually upsample the temporal dimension of hidden representations by factors of 2, 2, 2, 3, 5, while the number of channels is reduced by GBlocks 3, 6 and 7 (by a factor of 2 each). The final convolutional layer with [Tanh activation](https://paperswithcode.com/method/tanh-activation) produces a single-channel audio waveform.\r\n\r\nInstead of a single discriminator, GAN-TTS uses an ensemble of Random Window Discriminators (RWDs) which operate on randomly sub-sampled fragments of the real or generated samples. The ensemble allows for the evaluation of audio in different complementary ways.",
  "title": "High Fidelity Speech Synthesis with Adversarial Networks",
  "collection": "Text-to-Speech Models",
  "area": "Audio"
}
{
  "name": "Neural adjoint",
  "full_name": "Neural adjoint method",
  "description": "The NA method can be divided into two steps: (i) Training a neural network approximation of f , and (ii) inference of xˆ. Step (i) is conventional and involves training a generic neural network on a dataset\r\nˆ\r\nof input/output pairs from the simulator, denoted D, resulting in f, an approximation of the forward ˆ\r\nmodel. This is illustrated in the left inset of Fig 1. In step (ii), our goal is to use ∂f/∂x to help us gradually adjust x so that we achieve a desired output of the forward model, y. This is similar to many classical inverse modeling approaches, such as the popular Adjoint method [8, 9]. For many practical\r\nˆ\r\nexpression for the simulator, from which it is trivial to compute ∂f/∂x, and furthermore, we can use modern deep learning software packages to efficiently estimate gradients, given a loss function L.\r\nMore formally, let y be our target output, and let xˆi be our current estimate of the solution, where i indexes each solution we obtain in an iterative gradient-based estimation procedure. Then we compute xˆi+1 with\r\ninverse problems, however, obtaining ∂f/∂x requires significant expertise and/or effort, making these approaches challenging. Crucially, fˆ from step (i) provides us with a closed-form differentiable",
  "title": "Benchmarking deep inverse models over time, and the neural-adjoint method",
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "BiDet",
  "full_name": "BiDet",
  "description": "**BiDet** is a binarized neural network learning method for efficient object detection. Conventional network binarization methods directly quantize the weights and activations in one-stage or two-stage detectors with constrained representational capacity, so that the information redundancy in the networks causes numerous false positives and degrades the performance significantly. On the contrary, BiDet fully utilizes the representational capacity of the binary neural networks for object detection by redundancy removal, through which the detection precision is enhanced with alleviated false positives. Specifically, the information bottleneck (IB) principle is generalized to object detection, where the amount of information in the high-level feature maps is constrained and the mutual information between the feature maps and object detection is maximized.",
  "title": "BiDet: An Efficient Binarized Object Detector",
  "collection": "Binary Neural Networks",
  "area": "General"
}
{
  "name": "Relative Position Encodings",
  "full_name": "Relative Position Encodings",
  "description": "**Relative Position Encodings** are a type of position embeddings for [Transformer-based models](https://paperswithcode.com/methods/category/transformers) that attempts to exploit pairwise, relative positional information. Relative positional information is supplied to the model on two levels: values and keys. This becomes apparent in the two modified self-attention equations shown below. First, relative positional information is supplied to the model as an additional component to the keys\r\n\r\n$$ e\\_{ij} = \\frac{x\\_{i}W^{Q}\\left(x\\_{j}W^{K} + a^{K}\\_{ij}\\right)^{T}}{\\sqrt{d\\_{z}}} $$\r\n\r\nHere $a$ is an edge representation for the inputs $x\\_{i}$ and $x\\_{j}$. The [softmax](https://paperswithcode.com/method/softmax) operation remains unchanged from vanilla self-attention. Then relative positional information is supplied again as a sub-component of the values matrix:\r\n\r\n$$ z\\_{i} = \\sum^{n}\\_{j=1}\\alpha\\_{ij}\\left(x\\_{j}W^{V} + a\\_{ij}^{V}\\right)$$\r\n\r\nIn other words, instead of simply combining semantic embeddings with absolute positional ones, relative positional information is added to keys and values on the fly during attention calculation.\r\n\r\nSource: [Jake Tae](https://jaketae.github.io/study/relative-positional-encoding/)\r\n\r\nImage Source: [Relative Positional Encoding for Transformers with Linear Complexity](https://www.youtube.com/watch?v=qajudaEHuq8",
  "title": "Self-Attention with Relative Position Representations",
  "collection": "Position Embeddings",
  "area": "General"
}
{
  "name": "CodeBERT",
  "full_name": "CodeBERT",
  "description": "**CodeBERT** is a bimodal pre-trained model for programming language (PL) and natural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language code search, code documentation generation, etc. CodeBERT is developed with a [Transformer](https://paperswithcode.com/method/transformer)-based neural architecture, and is trained with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables the utilization of both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators.",
  "title": "CodeBERT: A Pre-Trained Model for Programming and Natural Languages",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Peer-attention",
  "full_name": "Peer-attention",
  "description": "**Peer-attention** is a network component which dynamically learns the attention weights using another block or input modality. This is unlike AssembleNet which partially relies on exponential mutations to explore connections. Once the attention weights are found, we can either prune the connections by only leaving the argmax over $h$ or leave them with [softmax](https://paperswithcode.com/method/softmax).",
  "title": "AssembleNet++: Assembling Modality Representations via Attention Connections",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "Strided Attention",
  "full_name": "Strided Attention",
  "description": "**Strided Attention** is a factorized attention pattern that has one head attend to the previous\r\n$l$ locations, and the other head attend to every $l$th location, where $l$ is the stride and chosen to be close to $\\sqrt{n}$. It was proposed as part of the [Sparse Transformer](https://paperswithcode.com/method/sparse-transformer) architecture.\r\n\r\nA self-attention layer maps a matrix of input embeddings $X$ to an output matrix and is parameterized by a connectivity pattern $S = \\text{set}\\left(S\\_{1}, \\dots, S\\_{n}\\right)$, where $S\\_{i}$ denotes the set of indices of the input vectors to which the $i$th output vector attends. The output vector is a weighted sum of transformations of the input vectors:\r\n\r\n$$ \\text{Attend}\\left(X, S\\right) = \\left(a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right)\\right)\\_{i\\in\\text{set}\\left(1,\\dots,n\\right)}$$\r\n\r\n$$ a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right) = \\text{softmax}\\left(\\frac{\\left(W\\_{q}\\mathbf{x}\\_{i}\\right)K^{T}\\_{S\\_{i}}}{\\sqrt{d}}\\right)V\\_{S\\_{i}} $$\r\n\r\n$$ K\\_{Si} = \\left(W\\_{k}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\n$$ V\\_{Si} = \\left(W\\_{v}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\nHere $W\\_{q}$, $W\\_{k}$, and $W\\_{v}$ represent the weight matrices which transform a given $x\\_{i}$ into a query, key, or value, and $d$ is the inner dimension of the queries and keys. The output at each position is a sum of the values weighted by the scaled dot-product similarity of the keys and queries.\r\n\r\nFull self-attention for autoregressive models defines $S\\_{i} = \\text{set}\\left(j : j \\leq i\\right)$, allowing every element to attend to all previous positions and its own position.\r\n\r\nFactorized self-attention instead has $p$ separate attention heads, where the $m$th head defines a subset of the indices $A\\_{i}^{(m)} ⊂ \\text{set}\\left(j : j \\leq i\\right)$ and lets $S\\_{i} = A\\_{i}^{(m)}$. The goal with the Sparse [Transformer](https://paperswithcode.com/method/transformer) was to find efficient choices for the subset $A$.\r\n\r\nFormally for Strided Attention, $A^{(1)}\\_{i} = ${$t, t + 1, ..., i$} for $t = \\max\\left(0, i − l\\right)$, and $A^{(2)}\\_{i} = ${$j : (i − j) \\mod l = 0$}. The $i$-th output vector of the attention head attends to all input vectors either from $A^{(1)}\\_{i}$ or $A^{(2)}\\_{i}$. This pattern can be visualized in the figure to the right.\r\n\r\nThis formulation is convenient if the data naturally has a structure that aligns with the stride, like images or some types of music. For data without a periodic structure, like text, however, the authors find that the network can fail to properly route information with the strided pattern, as spatial coordinates for an element do not necessarily correlate with the positions where the element may be most relevant in the future.",
  "title": "Generating Long Sequences with Sparse Transformers",
  "collection": "Attention Patterns",
  "area": "Natural Language Processing"
}
{
  "name": "PoolFormer",
  "full_name": "PoolFormer",
  "description": "PoolFormer is instantiated from MetaFormer by specifying the token mixer as extremely simple operator, pooling. PoolFormer is utilized as a tool to verify MetaFormer hypothesis \"MetaFormer is actually what you need\" (vs \"Attention is all you need\").",
  "title": "MetaFormer Is Actually What You Need for Vision",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "RoI Tanh-polar Transform",
  "full_name": "RoI Tanh-polar Transform",
  "description": "",
  "title": "RoI Tanh-polar Transformer Network for Face Parsing in the Wild",
  "collection": "Image Representations",
  "area": "Computer Vision"
}
{
  "name": "Off-Diagonal Orthogonal Regularization",
  "full_name": "Off-Diagonal Orthogonal Regularization",
  "description": "**Off-Diagonal Orthogonal Regularization** is a modified form of [orthogonal regularization](https://paperswithcode.com/method/orthogonal-regularization) originally used in [BigGAN](https://paperswithcode.com/method/biggan). The original orthogonal regularization is known to be limiting so the authors explore several variants designed to relax the constraint while still imparting the desired smoothness to the models. They opt for a modification where they remove diagonal terms from the regularization, and aim to minimize the pairwise cosine similarity between filters but does not constrain their norm:\r\n\r\n$$ R\\_{\\beta}\\left(W\\right) = \\beta|| W^{T}W \\odot \\left(\\mathbf{1}-I\\right) ||^{2}\\_{F} $$\r\n\r\nwhere $\\mathbf{1}$ denotes a matrix with all elements set to 1. The authors sweep $\\beta$ values and select $10^{−4}$.",
  "title": "Large Scale GAN Training for High Fidelity Natural Image Synthesis",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "CornerNet-Saccade",
  "full_name": "CornerNet-Saccade",
  "description": "**CornerNet-Saccade** is an extension of [CornerNet](https://paperswithcode.com/method/cornernet) with an attention mechanism similar to saccades in human vision. It starts with a downsized full image and generates an attention map, which is then zoomed in on and processed further by the model. This differs from the original CornerNet in that it is applied fully convolutionally across multiple scales.",
  "title": "CornerNet-Lite: Efficient Keypoint Based Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "TridentNet",
  "full_name": "TridentNet",
  "description": "**TridentNet** is an object detection architecture that aims to generate scale-specific feature\r\nmaps with a uniform representational power.  A parallel multi-branch architecture is constructed in which each branch shares the same transformation parameters but with different receptive fields. A scale-aware training scheme is used to specialize each branch by sampling object instances of proper scales for training.",
  "title": "Scale-Aware Trident Networks for Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "MFF",
  "full_name": "Multimodal Fuzzy Fusion Framework",
  "description": "BCI MI signal Classification Framework using Fuzzy integrals.\r\n\r\nPaper: Ko, L. W., Lu, Y. C., Bustince, H., Chang, Y. C., Chang, Y., Ferandez, J., ... & Lin, C. T. (2019). Multimodal fuzzy fusion for enhancing the motor-imagery-based brain computer interface. IEEE Computational Intelligence Magazine, 14(1), 96-106.",
  "title": "Choquet integral in decision analysis - lessons from the axiomatization",
  "collection": "Non-Parametric Classification",
  "area": "General"
}
{
  "name": "CGRU",
  "full_name": "Convolutional GRU",
  "description": "A **Convolutional Gated Recurrent Unit** is a type of [GRU](https://paperswithcode.com/method/gru) that combines GRUs with the [convolution](https://paperswithcode.com/method/convolution) operation. The update rule for input $x\\_{t}$ and the previous output $h\\_{t-1}$ is given by the following:\r\n\r\n$$ r = \\sigma\\left(W\\_{r} \\star\\_{n}\\left[h\\_{t-1};x\\_{t}\\right] + b\\_{r}\\right) $$\r\n\r\n$$ u = \\sigma\\left(W\\_{u} \\star\\_{n}\\left[h\\_{t-1};x\\_{t}\\right] + b\\_{u} \\right) $$\r\n\r\n$$ c = \\rho\\left(W\\_{c} \\star\\_{n}\\left[x\\_{t}; r \\odot h\\_{t-1}\\right] + b\\_{c} \\right) $$\r\n\r\n$$ h\\_{t} = u \\odot h\\_{t-1} + \\left(1-u\\right) \\odot c $$\r\n\r\nIn these equations $\\sigma$ and $\\rho$ are the elementwise sigmoid and [ReLU](https://paperswithcode.com/method/relu) functions respectively and the $\\star\\_{n}$ represents a convolution with a kernel of size $n \\times n$. Brackets are used to represent a feature concatenation.",
  "title": "Delving Deeper into Convolutional Networks for Learning Video Representations",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "DPT",
  "full_name": "Dense Prediction Transformer",
  "description": "**Dense Prediction Transformers** (DPT) are a type of [vision transformer](https://paperswithcode.com/method/vision-transformer) for dense prediction tasks.\r\n\r\nThe input image is transformed into tokens (orange) either by extracting non-overlapping patches followed by a linear projection of their flattened representation (DPT-Base and DPT-Large) or by applying a [ResNet](https://paperswithcode.com/method/resnet)-50 feature extractor (DPT-Hybrid). The image embedding is augmented with a positional embedding and a patch-independent readout token (red) is added. The tokens are passed through multiple [transformer](https://paperswithcode.com/method/transformer) stages. The tokens are reassembled from different stages into an image-like representation at multiple resolutions (green). Fusion modules (purple) progressively fuse and upsample the representations to generate a fine-grained prediction.",
  "title": "Vision Transformers for Dense Prediction",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "ALIGN",
  "full_name": "ALIGN",
  "description": "In the ALIGN method, visual and language representations are jointly trained from noisy image alt-text data. The image and text encoders are learned via contrastive loss (formulated as normalized softmax) that pushes the embeddings of the matched image-text pair together and pushing those of non-matched image-text pair apart. The model learns to align visual and language representations of the image and text pairs using the contrastive loss. The representations can be used for vision-only or vision-language task transfer. Without any fine-tuning, ALIGN powers zero-shot visual classification and cross-modal search including image-to-text search, text-to image search and even search with joint image+text queries.",
  "title": "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "LINE",
  "full_name": "Large-scale Information Network Embedding",
  "description": "LINE is a novel network embedding method which is suitable for arbitrary types of information networks: undirected, directed, and/or weighted. The method optimizes a carefully designed objective function that preserves both the local and global network structures.\r\n\r\nSource: [Tang et al.](https://arxiv.org/pdf/1503.03578v1.pdf)\r\n\r\nImage source: [Tang et al.](https://arxiv.org/pdf/1503.03578v1.pdf)",
  "title": "LINE: Large-scale Information Network Embedding",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "Unigram Segmentation",
  "full_name": "Unigram Segmentation",
  "description": "**Unigram Segmentation** is a subword segmentation algorithm based on a unigram language model. It provides multiple segmentations with probabilities. The language model allows for emulating the noise generated during the segmentation of actual data.\r\n\r\nThe unigram language model makes an assumption that each subword occurs independently, and consequently, the probability of a subword sequence $\\mathbf{x} = (x_1,\\ldots,x_M)$ is\r\nformulated as the product of the subword occurrence probabilities\r\n$p(x_i)$:\r\n\r\n$$\r\n  P(\\mathbf{x}) = \\prod_{i=1}^{M} p(x_i), \\\\\\\\\r\n  \\forall i\\,\\, x_i \\in \\mathcal{V},\\,\\,\\,\r\n  \\sum_{x \\in \\mathcal{V}} p(x) = 1, \\nonumber\r\n$$\r\n\r\nwhere $\\mathcal{V}$ is a pre-determined vocabulary.  The most probable\r\nsegmentation $\\mathbf{x}^*$ for the input sentence $X$ is then given by:\r\n\r\n$$\r\n  \\mathbf{x}^{*} = \\text{argmax}_{\\mathbf{x} \\in \\mathcal{S}(X)} P(\\mathbf{x}),\r\n$$\r\n\r\nwhere $\\mathcal{S}(X)$ is a set of segmentation candidates built from\r\nthe input sentence $X$.  $\\mathbf{x}^*$ is obtained with the Viterbi\r\nalgorithm.",
  "title": "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates",
  "collection": "Subword Segmentation",
  "area": "Natural Language Processing"
}
{
  "name": "Efficient Channel Attention",
  "full_name": "Efficient Channel Attention",
  "description": "**Efficient Channel Attention** is an architectural unit based on [squeeze-and-excitation](https://paperswithcode.com/method/squeeze-and-excitation-block) blocks that reduces model complexity without dimensionality reduction. It was proposed as part of the [ECA-Net](https://paperswithcode.com/method/eca-net) CNN architecture. \r\n\r\nAfter channel-wise [global average pooling](https://paperswithcode.com/method/global-average-pooling) without dimensionality reduction, the ECA captures local cross-channel interaction by considering every channel and its $k$ neighbors. The ECA can be efficiently implemented by fast $1D$ [convolution](https://paperswithcode.com/method/convolution) of size $k$, where kernel size $k$ represents the coverage of local cross-channel interaction, i.e., how many neighbors participate in attention prediction of one channel.",
  "title": "ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Linear Layer",
  "full_name": "Linear Layer",
  "description": "A **Linear Layer** is a projection $\\mathbf{XW + b}$.",
  "title": null,
  "collection": "Feedforward Networks",
  "area": "General"
}
{
  "name": "LocalViT",
  "full_name": "LocalViT",
  "description": "**LocalViT** aims to introduce depthwise convolutions to enhance local features modeling capability of ViTs. The network, as shown in Figure (c), brings localist mechanism into transformers through the depth-wise convolution (denoted by \"DW\"). To cope with the convolution operation, the conversation between sequence and image feature map is added by \"Seq2Img\" and \"Img2Seq\". The computation is as follows:\r\n\r\n$$\r\n\\mathbf{Y}^{r}=f\\left(f\\left(\\mathbf{Z}^{r} \\circledast \\mathbf{W}_{1}^{r} \\right) \\circledast \\mathbf{W}_d  \\right) \\circledast \\mathbf{W}_2^{r}\r\n$$\r\n\r\nwhere $\\mathbf{W}_{d} \\in \\mathbb{R}^{\\gamma d \\times 1 \\times k \\times k}$ is the kernel of the depth-wise convolution.\r\n\r\nThe input (sequence of tokens) is first reshaped to a feature map rearranged on a 2D lattice. Two convolutions along with a depth-wise convolution are applied to the feature map. The feature map is reshaped to a sequence of tokens which are used as by the self-attention of the network transformer layer.",
  "title": "LocalViT: Bringing Locality to Vision Transformers",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "BERT",
  "full_name": "BERT",
  "description": "**BERT**, or Bidirectional Encoder Representations from Transformers, improves upon standard [Transformers](http://paperswithcode.com/method/transformer) by removing the unidirectionality constraint by using a *masked language model* (MLM) pre-training objective. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-to-right language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pre-train a deep bidirectional Transformer. In addition to the masked language model, BERT uses a *next sentence prediction* task that jointly pre-trains text-pair representations. \r\n\r\nThere are two steps in BERT: *pre-training* and *fine-tuning*. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they\r\nare initialized with the same pre-trained parameters.",
  "title": "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "Prioritized Experience Replay",
  "full_name": "Prioritized Experience Replay",
  "description": "**Prioritized Experience Replay** is a type of [experience replay](https://paperswithcode.com/method/experience-replay) in reinforcement learning where we more frequently replay transitions with high expected learning progress, as measured by the magnitude of their temporal-difference (TD) error. This prioritization can lead to a loss of diversity, which is alleviated with stochastic prioritization, and introduce bias, which can be corrected with importance sampling.\r\n\r\nThe stochastic sampling method interpolates between pure greedy prioritization and uniform random sampling. The probability of being sampled is ensured to be monotonic in a transition's priority,  while guaranteeing a non-zero probability even for the lowest-priority transition. Concretely, define the probability of sampling transition $i$ as\r\n\r\n$$P(i) = \\frac{p_i^{\\alpha}}{\\sum_k p_k^{\\alpha}}$$\r\n\r\nwhere $p_i > 0$ is the priority of transition $i$. The exponent $\\alpha$ determines how much prioritization is used, with $\\alpha=0$ corresponding to the uniform case.\r\n\r\nPrioritized replay introduces bias because it changes this distribution in an uncontrolled fashion, and therefore changes the solution that the estimates will converge to. We can correct this bias by using\r\nimportance-sampling (IS) weights:\r\n\r\n$$ w\\_{i} = \\left(\\frac{1}{N}\\cdot\\frac{1}{P\\left(i\\right)}\\right)^{\\beta} $$\r\n\r\nthat fully compensates for the non-uniform probabilities $P\\left(i\\right)$ if $\\beta = 1$. These weights can be folded into the [Q-learning](https://paperswithcode.com/method/q-learning) update by using $w\\_{i}\\delta\\_{i}$ instead of $\\delta\\_{i}$ - weighted IS rather than ordinary IS. For stability reasons, we always normalize weights by $1/\\max\\_{i}w\\_{i}$ so\r\nthat they only scale the update downwards.\r\n\r\nThe two types of prioritization are proportional based, where $p\\_{i} = |\\delta\\_{i}| + \\epsilon$ and rank-based, where $p\\_{i} = \\frac{1}{\\text{rank}\\left(i\\right)}$, the latter where $\\text{rank}\\left(i\\right)$ is the rank of transition $i$ when the replay memory is sorted according to |$\\delta\\_{i}$|, For proportional based, hyperparameters used were $\\alpha = 0.7$, $\\beta\\_{0} = 0.5$. For the rank-based variant, hyperparameters used were $\\alpha = 0.6$, $\\beta\\_{0} = 0.4$.",
  "title": "Prioritized Experience Replay",
  "collection": "Replay Memory",
  "area": "Reinforcement Learning"
}
{
  "name": "DistilBERT",
  "full_name": "DistilBERT",
  "description": "**DistilBERT**  is a small, fast, cheap and light [Transformer](https://paperswithcode.com/method/transformer) model based on the [BERT](https://paperswithcode.com/method/bert) architecture. Knowledge distillation is performed during the pre-training phase to reduce the size of a BERT model by 40%. To leverage the inductive biases learned by larger models during pre-training, the authors introduce a triple loss combining language modeling, distillation and cosine-distance losses.",
  "title": "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "FLIP",
  "full_name": "FLIP",
  "description": "https://developer.nvidia.com/blog/flip-a-difference-evaluator-for-alternating-images/",
  "title": "FLIP: A Difference Evaluator for Alternating Images",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "RFB Net",
  "full_name": "RFB Net",
  "description": "**RFB Net** is a one-stage object detector that utilises a receptive field block module. It utilises a VGG16 backbone, and is otherwise quite similar to the [SSD](https://paperswithcode.com/method/ssd) architecture.",
  "title": "Receptive Field Block Net for Accurate and Fast Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "Bi3D",
  "full_name": "Bi3D",
  "description": "**Bi3D** is a stereo depth estimation framework that estimates depth via a series of binary classifications. Rather than testing if objects are at a particular depth *D*, as existing stereo methods do, it classifies them as being closer or farther than *D*. It takes the stereo pair and a disparity $d\\_{i}$ and produces a confidence map, which can be thresholded to yield the binary segmentation. To estimate depth on $N + 1$ quantization levels we run this network $N$ times and maximize the probability in Equation 8 (see paper). To estimate continuous depth, whether full or selective, we run the [SegNet](https://paperswithcode.com/method/segnet) block of Bi3DNet for each disparity level and work directly on the confidence volume.",
  "title": "Bi3D: Stereo Depth Estimation via Binary Classifications",
  "collection": "Stereo Depth Estimation Models",
  "area": "Computer Vision"
}
{
  "name": "NLSIG",
  "full_name": "nlogistic-sigmoid function",
  "description": "Nlogistic-sigmoid function (NLSIG) is a modern logistic-sigmoid function definition for modelling growth (or decay) processes. It features two logistic metrics (YIR and XIR) for monitoring growth from a two-dimensional (x-y axis) perspective.",
  "title": null,
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Cross-Attention Module",
  "full_name": "Cross-Attention Module",
  "description": "The **Cross-Attention** module is an attention module used in [CrossViT](https://paperswithcode.com/method/crossvit) for fusion of multi-scale features. The CLS token of the large branch (circle) serves as a query token to interact with the patch tokens from the small branch through attention. $f\\left(·\\right)$ and $g\\left(·\\right)$ are projections to align dimensions. The small branch follows the same procedure but swaps CLS and patch tokens from another branch.",
  "title": "CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "NetMF",
  "full_name": "Network Embedding as Matrix Factorization:",
  "description": "",
  "title": "Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "DeepCluster",
  "full_name": "DeepCluster",
  "description": "**DeepCluster** is a self-supervision approach for learning image representations.  DeepCluster iteratively groups the features with a standard clustering algorithm, k-means, and uses the subsequent assignments as supervision to update\r\nthe weights of the network",
  "title": "Deep Clustering for Unsupervised Learning of Visual Features",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "Meta-augmentation",
  "full_name": "Meta-augmentation",
  "description": "**Meta-augmentation** helps generate more varied tasks for a single example in meta-learning. It can be distinguished from data augmentation in classic machine learning as follows. For data augmentation in classical machine learning, the aim is to generate more varied examples, within a single task. Meta-augmentation has the exact opposite aim: we wish to generate more varied tasks,\r\nfor a single example, to force the learner to quickly learn a new task from feedback. In meta-augmentation, adding randomness discourages the base learner and model from learning trivial solutions that do not generalize to new tasks.",
  "title": "Meta-Learning Requires Meta-Augmentation",
  "collection": "Meta-Learning Algorithms",
  "area": "General"
}
{
  "name": "AutoDropout",
  "full_name": "AutoDropout",
  "description": "**AutoDropout** automates the process of designing [dropout](https://paperswithcode.com/method/dropout) patterns using a [Transformer](https://paperswithcode.com/method/transformer) based controller. In this method, a controller learns to generate a dropout pattern at every channel and layer of a target network, such as a [ConvNet](https://paperswithcode.com/methods/category/convolutional-neural-networks) or a Transformer. The target network is then trained with the dropped-out pattern, and its resulting validation performance is used as a signal for the controller to learn from. The resulting pattern is applied to a convolutional output channel, which is a common building block of image recognition models.\r\n\r\nThe controller network generates the tokens to describe the configurations of the dropout pattern. The tokens are generated like words in a language model. For every layer in a ConvNet, a group of 8 tokens need to be made to create a dropout pattern. These 8 tokens are generated sequentially. In the figure above, size, stride, and repeat indicate the size and the tiling of the pattern; rotate, shear_x, and shear_y specify the geometric transformations of the pattern; share_c is a binary deciding whether a pattern is applied to all $C$ channels; and residual is a binary deciding whether the pattern is applied to the residual branch as well. If we need $L$ dropout patterns, the controller will generate $8L$ decisions.",
  "title": "AutoDropout: Learning Dropout Patterns to Regularize Deep Networks",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Polya-Gamma Augmentation",
  "full_name": "Data augmentation using Polya-Gamma latent variables.",
  "description": "This method applies Polya-Gamma latent variables as a way to obtain closed form expressions for full-conditionals of posterior distributions in sampling algorithms like MCMC.",
  "title": "Bayesian inference for logistic models using Polya-Gamma latent variables",
  "collection": "Latent Variable Sampling",
  "area": "General"
}
{
  "name": "NEAT",
  "full_name": "Neural Attention Fields",
  "description": "**NEAT**, or **Neural Attention Fields**, is a feature representation for end-to-end imitation learning models. NEAT is a continuous function which maps locations in Bird's Eye View (BEV) scene coordinates to waypoints and semantics, using intermediate attention maps to iteratively compress high-dimensional 2D image features into a compact representation. This allows the model to selectively attend to relevant regions in the input while ignoring information irrelevant to the driving task, effectively associating the images with the BEV representation. Furthermore, visualizing the attention maps for models with NEAT intermediate representations provides improved interpretability.",
  "title": "NEAT: Neural Attention Fields for End-to-End Autonomous Driving",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "FMix",
  "full_name": "FMix",
  "description": "A variant of [CutMix](https://paperswithcode.com/method/cutmix) which randomly samples masks from Fourier space.",
  "title": "FMix: Enhancing Mixed Sample Data Augmentation",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "TS",
  "full_name": "Spatio-temporal stability analysis",
  "description": "Spatio-temporal features extraction that measure the stabilty. The proposed method is based on a compression algorithm named Run Length Encoding. The workflow of the method is presented bellow.",
  "title": null,
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "ByteScheduler",
  "full_name": "ByteScheduler",
  "description": "**ByteScheduler** is a generic communication scheduler for distributed DNN training acceleration. It is based on analysis that partitioning and rearranging the tensor transmissions can result in optimal results in theory and good performance in real-world even with scheduling overhead.",
  "title": null,
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "Parallax",
  "full_name": "Parallax",
  "description": "**Parallax** is a hybrid parallel method for training large neural networks. Parallax is a framework that optimizes data parallel training by utilizing the sparsity of model parameters. Parallax introduces a hybrid approach that combines Parameter Server and AllReduce architectures to optimize the amount of data transfer according to the sparsity.\r\n\r\nParallax pursues a hybrid approach that uses the Parameter Server architecture for handling sparse variables and the AllReduce architecture for handling dense variables. Moreover, Parallax partitions large sparse variables by a near-optimal number of partitions to maximize parallelism while maintaining low computation and communication overhead. Parallax further optimizes training with local aggregation and smart operation placement to mitigate communication overhead. Graph transformation in Parallax automatically applies all of these optimizations and the data parallel training itself at the framework level to minimize user efforts for writing and optimizing a distributed program by composing low-level primitives.",
  "title": null,
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "VocGAN",
  "full_name": "VocGAN",
  "description": "Please enter a description about the method here",
  "title": "VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network",
  "collection": "Generative Audio Models",
  "area": "Audio"
}
{
  "name": "VIME",
  "full_name": "Value Imputation and Mask Estimation",
  "description": "**VIME **, or **Value Imputation and Mask Estimation**, is a self- and semi-supervised learning framework for tabular data. It consists of a pretext task of estimating mask vectors from corrupted tabular data in addition to the reconstruction pretext task for self-supervised learning.",
  "title": "VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "CNN BiLSTM",
  "full_name": "CNN Bidirectional LSTM",
  "description": "A **CNN BiLSTM** is a hybrid bidirectional [LSTM](https://paperswithcode.com/method/lstm) and CNN architecture. In the original formulation applied to named entity recognition, it learns both character-level and word-level features. The CNN component is used to induce the character-level features. For each word the model employs a [convolution](https://paperswithcode.com/method/convolution) and a [max pooling](https://paperswithcode.com/method/max-pooling) layer to extract a new feature vector from the per-character feature vectors such as character embeddings and (optionally) character type.",
  "title": "Named Entity Recognition with Bidirectional LSTM-CNNs",
  "collection": "Bidirectional Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "ESIM",
  "full_name": "Enhanced Sequential Inference Model",
  "description": "**Enhanced Sequential Inference Model** or **ESIM** is a sequential NLI model proposed in [Enhanced LSTM for Natural Language Inference](https://www.aclweb.org/anthology/P17-1152) paper.",
  "title": "Enhanced LSTM for Natural Language Inference",
  "collection": "Sequence To Sequence Models",
  "area": "Sequential"
}
{
  "name": "YOLOP",
  "full_name": "YOLOP",
  "description": "**YOLOP** is a panoptic driving perception network for handling traffic object detection, drivable area segmentation and lane detection simultaneously. It is composed of one encoder for feature extraction and three decoders to handle the specific tasks. It can be thought of a lightweight version of Tesla's HydraNet model for self-driving cars.\r\n\r\nA lightweight CNN, from Scaled-yolov4, is used as the encoder to extract features from the image. Then these feature maps are fed to three decoders to complete their respective tasks. The detection decoder is based on the current best-performing single-stage detection network, [YOLOv4](https://paperswithcode.com/method/yolov4),  for two main reasons: (1) The single-stage detection network is faster than the two-stage detection network. (2) The grid-based prediction mechanism of the single-stage detector is more related to the other two semantic segmentation tasks, while instance segmentation is usually combined with the region based detector as in [Mask R-CNN](https://paperswithcode.com/method/mask-r-cnn). The feature map output by the encoder incorporates semantic features of different levels and scales, and our segmentation branch can use these feature maps to complete pixel-wise semantic prediction.",
  "title": "YOLOP: You Only Look Once for Panoptic Driving Perception",
  "collection": "One-Stage Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "PinvGCN",
  "full_name": "Pseudoinverse Graph Convolutional Network",
  "description": "A [GCN](https://paperswithcode.com/method/gcn) method targeted at the unique spectral properties of dense graphs and hypergraphs, enabled by efficient numerical linear algebra.",
  "title": "Pseudoinverse Graph Convolutional Networks: Fast Filters Tailored for Large Eigengaps of Dense Graphs and Hypergraphs",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "Demon CM",
  "full_name": "Demon CM",
  "description": "**Demon CM**, or **SGD with Momentum and Demon**,  is the [Demon](https://paperswithcode.com/method/demon) momentum rule applied to [SGD with momentum](https://paperswithcode.com/method/sgd-with-momentum).\r\n\r\n$$ \\beta\\_{t} = \\beta\\_{init}\\cdot\\frac{\\left(1-\\frac{t}{T}\\right)}{\\left(1-\\beta\\_{init}\\right) + \\beta\\_{init}\\left(1-\\frac{t}{T}\\right)} $$\r\n\r\n$$ \\theta\\_{t+1} = \\theta\\_{t} - \\eta{g}\\_{t} + \\beta\\_{t}v\\_{t} $$\r\n\r\n$$ v\\_{t+1} = \\beta\\_{t}{v\\_{t}} - \\eta{g\\_{t}} $$",
  "title": "Demon: Improved Neural Network Training with Momentum Decay",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Unified VLP",
  "full_name": "Unified VLP",
  "description": "Unified VLP is unified encoder-decoder model for general vision-language pre-training. The models uses a shared multi-layer transformers network for both encoding and decoding. The model is pre-trained on large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. Model architecture for pre-training. For pre-training , the input comprises of image input, sentence input, and three special tokens ([CLS], [SEP], [STOP]). The image is processed as $N$ Region of Interests (RoIs) and region features are extracted. The sentence is tokenized and masked with [MASK] tokens for the later masked language modeling task. The model consists of 12 layers of Transformer blocks, each having a masked self-attention layer and feed-forward module, where the self-attention mask controls what input context the prediction conditions on. Two self-attention masks are implemented depending on whether the objective is bidirectional or seq2seq. The model is fine-tuned for image captioning and visual question answering.",
  "title": "Unified Vision-Language Pre-Training for Image Captioning and VQA",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "MoCo v2",
  "full_name": "MoCo v2",
  "description": "**MoCo v2** is an improved version of the [Momentum Contrast](https://paperswithcode.com/method/moco) self-supervised learning algorithm. Motivated by the findings presented in the [SimCLR](https://paperswithcode.com/method/simclr) paper, authors:\r\n\r\n- Replace the 1-layer fully connected layer with a 2-layer MLP head with [ReLU](https://paperswithcode.com/method/relu) for the unsupervised training stage.\r\n- Include blur augmentation.\r\n- Use cosine learning rate schedule.\r\n\r\nThese modifications enable MoCo to outperform the state-of-the-art SimCLR with a smaller batch size and fewer epochs.",
  "title": "Improved Baselines with Momentum Contrastive Learning",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "Flow Matching",
  "full_name": "Conditional / Rectified flow matching",
  "description": "Conditional Flow Matching (CFM) is a fast way to train continuous normalizing flow (CNF) models. CFM is a simulation-free training objective for continuous normalizing flows that allows conditional generative modelling and speeds up training and inference.",
  "title": "Flow Matching for Generative Modeling",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "GIN",
  "full_name": "Graph Isomorphism Network",
  "description": "Per the authors, Graph Isomorphism Network (GIN) generalizes the WL test and hence achieves maximum discriminative power among GNNs.",
  "title": "How Powerful are Graph Neural Networks?",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "PyramidNet",
  "full_name": "PyramidNet",
  "description": "A **PyramidNet** is a type of convolutional network where the key idea is to concentrate on the feature map dimension by increasing it gradually instead of by increasing it sharply at each residual unit with downsampling. In addition, the network architecture works as a mixture of both plain and residual networks by using zero-padded identity-mapping shortcut connections when increasing the feature map dimension.",
  "title": "Deep Pyramidal Residual Networks",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "DenseNAS-C",
  "full_name": "DenseNAS-C",
  "description": "**DenseNAS-C** is a mobile convolutional neural network discovered through the [DenseNAS](https://paperswithcode.com/method/densenas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) method. The basic building block is MBConvs, or inverted bottleneck residuals, from the [MobileNet](https://paperswithcode.com/method/mobilenetv2) architectures.",
  "title": "Densely Connected Search Space for More Flexible Neural Architecture Search",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Fragmentation",
  "full_name": "Fragmentation",
  "description": "Given a pattern $P,$ that is more complicated than the patterns, we fragment $P$ into simpler patterns such that their exact count is known. In the subgraph GNN proposed earlier, look into the subgraph of the host graph. We have seen that this technique is scalable on large graphs. Also, we have seen that subgraph GNN is more expressive and efficient than traditional GNN. So, we tried to explore the expressibility when the pattern is fragmented into smaller subpatterns.",
  "title": "Improving Expressivity of Graph Neural Networks using Localization",
  "collection": "Localization Models",
  "area": "Computer Vision"
}
{
  "name": "Selective Search",
  "full_name": "Selective Search",
  "description": "**Selective Search** is a region proposal algorithm for object detection tasks. It starts by over-segmenting the image based on intensity of the pixels using a graph-based segmentation method by Felzenszwalb and Huttenlocher. Selective Search then takes these oversegments as initial input and performs the following steps\r\n\r\n1. Add all bounding boxes corresponding to segmented parts to the list of regional proposals\r\n2. Group adjacent segments based on similarity\r\n3. Go to step 1\r\n\r\nAt each iteration, larger segments are formed and added to the list of region proposals. Hence we create region proposals from smaller segments to larger segments in a bottom-up approach. This is what we mean by computing “hierarchical” segmentations using Felzenszwalb and Huttenlocher’s oversegments.",
  "title": null,
  "collection": "Region Proposal",
  "area": "Computer Vision"
}
{
  "name": "Point-wise Spatial Attention",
  "full_name": "Point-wise Spatial Attention",
  "description": "**Point-wise Spatial Attention (PSA)** is a [semantic segmentation](https://paperswithcode.com/task/semantic-segmentation) module. The goal is capture contextual information, especially in the long range, by aggregating information. Through the PSA module, information aggregation is performed as a kind of information flow where we adaptively learn a pixel-wise global attention map for each position from two perspectives to aggregate contextual information over the entire feature map.\r\n\r\nThe PSA module takes a spatial feature map $\\mathbf{X}$ as input. We denote the spatial size of $\\mathbf{X}$ as $H \\times W$. Through the two branches as illustrated, we generate pixel-wise global attention maps for each position in feature map $\\mathbf{X}$ through several convolutional layers.\r\n\r\nWe aggregate input feature maps based on attention maps to generate new feature representations with the long-range contextual information incorporated, i.e., $\\mathbf{Z}\\_{c}$ from the ‘collect’ branch and $\\mathbf{Z}\\_{d}$ from the ‘distribute’ branch.\r\n\r\nWe concatenate the new representations $\\mathbf{Z}\\_{c}$ and $\\mathbf{Z}\\_{d}$ and apply a convolutional layer with [batch normalization](https://paperswithcode.com/method/batch-normalization) and activation layers for dimension reduction and feature fusion. Then we concatenate the new global contextual feature with the local representation feature $\\mathbf{X}$. It is followed by one or several convolutional layers with batch normalization and activation layers to generate the final feature map for following subnetworks.",
  "title": "PSANet: Point-wise Spatial Attention Network for Scene Parsing",
  "collection": "Semantic Segmentation Modules",
  "area": "Computer Vision"
}
{
  "name": "Swish",
  "full_name": "Swish",
  "description": "**Swish** is an activation function, $f(x) = x \\cdot \\text{sigmoid}(\\beta x)$, where $\\beta$ a learnable parameter. Nearly all implementations do not use the learnable parameter $\\beta$, in which case the activation function is $x\\sigma(x)$ (\"Swish-1\").\r\n\r\nThe function $x\\sigma(x)$ is exactly the [SiLU](https://paperswithcode.com/method/silu), which was introduced by other authors before the swish.\r\nSee [Gaussian Error Linear Units](https://arxiv.org/abs/1606.08415) ([GELUs](https://paperswithcode.com/method/gelu)) where the SiLU (Sigmoid Linear Unit) was originally coined, and see [Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning](https://arxiv.org/abs/1702.03118) and [Swish: a Self-Gated Activation Function](https://arxiv.org/abs/1710.05941v1) where the same activation function was experimented with later.",
  "title": "Searching for Activation Functions",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "EVM",
  "full_name": "Extreme Value Machine",
  "description": "",
  "title": "The Extreme Value Machine",
  "collection": "Output Functions",
  "area": "General"
}
{
  "name": "SchNet",
  "full_name": "Schrödinger Network",
  "description": "**SchNet** is an end-to-end deep neural network architecture based on continuous-filter convolutions. It follows the deep tensor neural network framework, i.e. atom-wise representations are constructed by starting from embedding vectors that characterize the atom type before introducing the configuration of the system by a series of interaction blocks.",
  "title": "SchNet: A continuous-filter convolutional neural network for modeling quantum interactions",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "Discriminative Fine-Tuning",
  "full_name": "Discriminative Fine-Tuning",
  "description": "**Discriminative Fine-Tuning** is a fine-tuning strategy that is used for [ULMFiT](https://paperswithcode.com/method/ulmfit) type models. Instead of using the same learning rate for all layers of the model, discriminative fine-tuning allows us to tune each layer with different learning rates. For context, the regular stochastic gradient descent ([SGD](https://paperswithcode.com/method/sgd)) update of a model’s parameters $\\theta$ at time step $t$ looks like the following (Ruder, 2016):\r\n\r\n$$ \\theta\\_{t} = \\theta\\_{t-1} − \\eta\\cdot\\nabla\\_{\\theta}J\\left(\\theta\\right)$$\r\n\r\nwhere $\\eta$ is the learning rate and $\\nabla\\_{\\theta}J\\left(\\theta\\right)$ is the gradient with regard to the model’s objective function. For discriminative fine-tuning, we split the parameters $\\theta$ into {$\\theta\\_{1}, \\ldots, \\theta\\_{L}$} where $\\theta\\_{l}$ contains the parameters of the model at the $l$-th layer and $L$ is the number of layers of the model. Similarly, we obtain {$\\eta\\_{1}, \\ldots, \\eta\\_{L}$} where $\\theta\\_{l}$ where $\\eta\\_{l}$ is the learning rate of the $l$-th layer. The SGD update with discriminative finetuning is then:\r\n\r\n$$ \\theta\\_{t}^{l} = \\theta\\_{t-1}^{l} - \\eta^{l}\\cdot\\nabla\\_{\\theta^{l}}J\\left(\\theta\\right) $$\r\n\r\nThe authors find that empirically it worked well to first choose the learning rate $\\eta^{L}$ of the last layer by fine-tuning only the last layer and using $\\eta^{l-1}=\\eta^{l}/2.6$ as the learning rate for lower layers.",
  "title": "Universal Language Model Fine-tuning for Text Classification",
  "collection": "Fine-Tuning",
  "area": "General"
}
{
  "name": "Relativistic GAN",
  "full_name": "Relativistic GAN",
  "description": "A **Relativistic GAN** is a type of generative adversarial network. It has a relativistic discriminator which estimates the probability that the given real data is more realistic than a randomly sampled fake data. The idea is to endow GANs with the property that the probability of real data being real ($D\\left(x\\_{r}\\right)$) should decrease as the probability of fake data being real ($D\\left(x\\_{f}\\right)$) increases.\r\n\r\nWith a standard [GAN](https://paperswithcode.com/method/gan), we can achieve this as follows. The standard GAN discriminator can be defined, in term of the non-transformed layer $C\\left(x\\right)$, as $D\\left(x\\right) = \\text{sigmoid}\\left(C\\left(x\\right)\\right)$. A simple way to make discriminator relativistic - having the output of $D$ depend on both real and fake data - is to sample from real/fake data pairs $\\tilde{x} = \\left(x\\_{r}, x\\_{f}\\right)$ and define it as $D\\left(\\tilde{x}\\right) = \\text{sigmoid}\\left(C\\left(x\\_{r}\\right) − C\\left(x\\_{f}\\right)\\right)$. The modification can be interpreted as: the discriminator estimates the probability\r\nthat the given real data is more realistic than a randomly sampled fake data.\r\n\r\nMore generally a Relativistic GAN can be interpreted as having a discriminator of the form $a\\left(C\\left(x\\_{r}\\right)−C\\left(x\\_{f}\\right)\\right)$, where $a$ is the activation function, to be relativistic.",
  "title": "The relativistic discriminator: a key element missing from standard GAN",
  "collection": "Generative Adversarial Networks",
  "area": "Computer Vision"
}
{
  "name": "Dynamic Memory Network",
  "full_name": "Dynamic Memory Network",
  "description": "A **Dynamic Memory Network** is a neural network architecture which processes input sequences and questions, forms episodic memories, and generates relevant answers. Questions trigger an iterative attention process which allows the model to condition its attention on the inputs and the result of previous iterations. These results are then reasoned over in a hierarchical recurrent sequence model to generate answers. \r\n\r\nThe DMN consists of a number of modules:\r\n\r\n- Input Module: The input module encodes raw text inputs from the task into distributed vector representations. The input takes forms like a sentence, a long story, a movie review and so on.\r\n- Question Module: The question module encodes the question of the task into a distributed\r\nvector representation. For question answering, the question may be a sentence such as \"Where did the author first fly?\". The representation is fed into the episodic memory module, and forms the basis, or initial state, upon which the episodic memory module iterates.\r\n- Episodic Memory Module: Given a collection of input representations, the episodic memory module chooses which parts of the inputs to focus on through the attention mechanism. It then produces a ”memory” vector representation taking into account the question as well as the previous memory. Each iteration provides the module with newly relevant information about the input. In other words,\r\nthe module has the ability to retrieve new information, in the form of input representations, which were thought to be irrelevant in previous iterations.\r\n- Answer Module: The answer module generates an answer from the final memory vector of the memory module.",
  "title": "Ask Me Anything: Dynamic Memory Networks for Natural Language Processing",
  "collection": "Working Memory Models",
  "area": "General"
}
{
  "name": "Style-based Recalibration Module",
  "full_name": "Style-based Recalibration Module",
  "description": "A **Style-based Recalibration Module (SRM)** is a module for convolutional neural networks that adaptively recalibrates intermediate feature maps by exploiting their styles. SRM first extracts the style information from each channel of the feature maps by style pooling, then estimates per-channel recalibration weight via channel-independent style integration. By incorporating the relative importance of individual styles into feature maps, SRM is aimed at enhancing the representational ability of a CNN.\r\n\r\nThe overall structure of SRM is illustrated in the Figure to the right. It consists of two main components: style pooling and style integration. The style pooling operator extracts style features\r\nfrom each channel by summarizing feature responses across spatial dimensions. It is followed by the style integration operator, which produces example-specific style weights by utilizing the style features via channel-wise operation. The style weights finally recalibrate the feature maps to either\r\nemphasize or suppress their information.",
  "title": "SRM : A Style-based Recalibration Module for Convolutional Neural Networks",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "PALED",
  "full_name": "Segmentation of patchy areas in biomedical images based on local edge density estimation",
  "description": "An effective approach to the quantification of patchiness in biomedical images according to their local edge densities.",
  "title": "Segmentation of patchy areas in biomedical images based on local edge density estimation",
  "collection": "Image Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "DEQ",
  "full_name": "Deep Equilibrium Models",
  "description": "A new kind of implicit models, where the output of the network is defined as the solution to an \"infinite-level\" fixed point equation. Thanks to this we can compute the gradient of the output without activations and therefore with a significantly reduced memory footprint.",
  "title": "Deep Equilibrium Models",
  "collection": "Robust Training",
  "area": "General"
}
{
  "name": "CurvVAE",
  "full_name": "Curvature Regularized Variational Auto-Encoder",
  "description": "",
  "title": "Learning from Demonstration using a Curvature Regularized Variational Auto-Encoder (CurvVAE)",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "XLNet",
  "full_name": "XLNet",
  "description": "**XLNet** is an autoregressive [Transformer](https://paperswithcode.com/method/transformer) that leverages the best of both autoregressive language modeling and autoencoding while attempting to avoid their limitations. Instead of using a fixed forward or backward factorization order as in conventional autoregressive models, XLNet maximizes the expected log likelihood of a sequence w.r.t. all possible permutations of the factorization order. Thanks to the permutation operation, the context for each position can consist of tokens from both left and right. In expectation, each position learns to utilize contextual information from all positions, i.e., capturing bidirectional context.\r\n\r\nAdditionally, inspired by the latest advancements in autogressive language modeling, XLNet integrates the segment recurrence mechanism and relative encoding scheme of [Transformer-XL](https://paperswithcode.com/method/transformer-xl) into pretraining, which empirically improves the performance especially for tasks involving a longer text sequence.",
  "title": "XLNet: Generalized Autoregressive Pretraining for Language Understanding",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Blink Communication",
  "full_name": "Blink Communication",
  "description": "**Blink** is a communication library for inter-GPU parameter exchange that achieves near-optimal link utilization. To handle topology heterogeneity from hardware generations or partial allocations from cluster schedulers, Blink dynamically generates optimal communication primitives for a given topology. Blink probes the set of links available for a given job at runtime and builds a topology with appropriate link capacities. Given the topology, Blink achieves the optimal communication rate by packing spanning trees, that can utilize more links (Lovasz, 1976; Edmonds, 1973) when compared to rings. The authors use a multiplicative-weight update based approximation algorithm to quickly compute the maximal packing and extend the algorithm to further minimize the number of trees generated. Blink’s collectives extend across multiple machines effectively utilizing all available network interfaces.",
  "title": "Blink: Fast and Generic Collectives for Distributed ML",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "AEDA",
  "full_name": "An Easier Data Augmentation",
  "description": "**AEDA**, or **An Easier Data Augmentation**, is a type of data augmentation technique for text classification which includes only the insertion of various punctuation marks into the input sequence. AEDA preserves all the input information and does not mislead the network since it keeps the word order intact while changing their positions in that the words are shifted to the right.",
  "title": "AEDA: An Easier Data Augmentation Technique for Text Classification",
  "collection": "Text Data Augmentation",
  "area": "Natural Language Processing"
}
{
  "name": "Random Resized Crop",
  "full_name": "Random Resized Crop",
  "description": "**RandomResizedCrop** is a type of image data augmentation where a crop of random size of the original size and a random aspect ratio of the original aspect ratio is made. This crop is finally resized to given size.\r\n\r\nImage Credit: [Apache MXNet](https://mxnet.apache.org/versions/1.5.0/tutorials/gluon/data_augmentation.html)",
  "title": null,
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "Sparse Autoencoder",
  "full_name": "Sparse Autoencoder",
  "description": "A **Sparse Autoencoder** is a type of autoencoder that employs sparsity to achieve an information bottleneck. Specifically the loss function is constructed so that activations are penalized within a layer. The sparsity constraint can be imposed with [L1 regularization](https://paperswithcode.com/method/l1-regularization) or a KL divergence between expected average neuron activation to an ideal distribution $p$.\r\n\r\nImage: [Jeff Jordan](https://www.jeremyjordan.me/autoencoders/). Read his blog post (click) for a detailed summary of autoencoders.",
  "title": null,
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Chinese Pre-trained Unbalanced Transformer",
  "full_name": "Chinese Pre-trained Unbalanced Transformer",
  "description": "**CPT**, or **Chinese Pre-trained Unbalanced Transformer**, is a pre-trained unbalanced [Transformer](https://paperswithcode.com/method/transformer) for Chinese natural language understanding (NLU) and natural language generation (NLG) tasks. CPT consists of three parts: a shared encoder, an understanding decoder, and a generation decoder. Two specific decoders with a shared encoder are pre-trained with masked language modeling (MLM) and denoising auto-encoding (DAE) tasks, respectively. With the partially shared architecture and multi-task pre-training, CPT can (1) learn specific knowledge of both NLU or NLG tasks with two decoders and (2) be fine-tuned flexibly that fully exploits the potential of the model. Two specific decoders with a shared encoder are pre-trained with masked language modeling (MLM) and denoising auto-encoding (DAE) tasks, respectively. With the partially shared architecture and multi-task pre-training, CPT can (1) learn specific knowledge of both NLU or NLG tasks with two decoders and (2) be fine-tuned flexibly that fully exploits the potential of the model.",
  "title": "CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Symbolic Deep Learning",
  "full_name": "Symbolic Deep Learning",
  "description": "This is a general approach to convert a neural network into an analytic equation. The technique works as follows:\r\n\r\n1. Encourage sparse latent representations\r\n2. Apply symbolic regression to approximate the transformations between in/latent/out layers\r\n3. Compose the symbolic expressions.\r\n\r\nIn the [paper](https://arxiv.org/abs/2006.11287), we show that we find the correct known equations, including force laws and Hamiltonians, can be extracted from the neural network. We then apply our method to a non-trivial cosmology example-a detailed dark matter simulation-and discover a new analytic formula which can predict the concentration of dark matter from the mass distribution of nearby cosmic structures. The symbolic expressions extracted from the GNN using our technique also generalized to out-of-distribution data better than the GNN itself. Our approach offers alternative directions for interpreting neural networks and discovering novel physical principles from the representations they learn.",
  "title": "Discovering Symbolic Models from Deep Learning with Inductive Biases",
  "collection": "Interpretability",
  "area": "General"
}
{
  "name": "Online Normalization",
  "full_name": "Online Normalization",
  "description": "**Online Normalization** is a normalization technique for training deep neural networks. To define Online Normalization. we replace arithmetic averages over the full dataset in with exponentially decaying averages of online samples. The decay factors $\\alpha\\_{f}$ and $\\alpha\\_{b}$ for forward and backward passes respectively are hyperparameters for the technique.\r\n\r\nWe allow incoming samples $x\\_{t}$, such as images, to have multiple scalar components and denote\r\nfeature-wide mean and variance by $\\mu\\left(x\\_{t}\\right)$ and $\\sigma^{2}\\left(x\\_{t}\\right)$. The algorithm also applies to outputs of fully connected layers with only one scalar output per feature. In fact, this case simplifies to $\\mu\\left(x\\_{t}\\right) = x\\_{t}$ and $\\sigma\\left(x\\_{t}\\right) = 0$. Denote scalars $\\mu\\_{t}$ and $\\sigma\\_{t}$ to denote running estimates of mean and variance across\r\nall samples. The subscript $t$ denotes time steps corresponding to processing new incoming samples.\r\n\r\nOnline Normalization uses an ongoing process during the forward pass to estimate activation means\r\nand variances. It implements the standard online computation of mean and variance generalized to processing multi-value samples and exponential averaging of sample statistics. The\r\nresulting estimates directly lead to an affine normalization transform.\r\n\r\n$$ y\\_{t} = \\frac{x\\_{t} - \\mu\\_{t-1}}{\\sigma\\_{t-1}} $$ \r\n\r\n$$ \\mu\\_{t} = \\alpha\\_{f}\\mu\\_{t-1} + \\left(1-\\alpha\\_{f}\\right)\\mu\\left(x\\_{t}\\right) $$\r\n\r\n$$ \\sigma^{2}\\_{t} = \\alpha\\_{f}\\sigma^{2}\\_{t-1} + \\left(1-\\alpha\\_{f}\\right)\\sigma^{2}\\left(x\\_{t}\\right) + \\alpha\\_{f}\\left(1-\\alpha\\_{f}\\right)\\left(\\mu\\left(x\\_{t}\\right) - \\mu\\_{t-1}\\right)^{2} $$",
  "title": "Online Normalization for Training Neural Networks",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "Shake-Shake Regularization",
  "full_name": "Shake-Shake Regularization",
  "description": "**Shake-Shake Regularization**  aims to improve the generalization ability of multi-branch networks by replacing the standard summation of parallel branches with a stochastic affine combination. A typical pre-activation [ResNet](https://paperswithcode.com/method/resnet) with 2 residual branches would follow this equation:\r\n\r\n$$x\\_{i+1} = x\\_{i} + \\mathcal{F}\\left(x\\_{i}, \\mathcal{W}\\_{i}^{\\left(1\\right)}\\right) + \\mathcal{F}\\left(x\\_{i}, \\mathcal{W}\\_{i}^{\\left(2\\right)}\\right) $$\r\n\r\nShake-shake regularization introduces a random variable $\\alpha\\_{i}$  following a uniform distribution between 0 and 1 during training:\r\n\r\n$$x\\_{i+1} = x\\_{i} + \\alpha\\mathcal{F}\\left(x\\_{i}, \\mathcal{W}\\_{i}^{\\left(1\\right)}\\right) + \\left(1-\\alpha\\right)\\mathcal{F}\\left(x\\_{i}, \\mathcal{W}\\_{i}^{\\left(2\\right)}\\right) $$\r\n\r\nFollowing the same logic as for [dropout](https://paperswithcode.com/method/dropout), all $\\alpha\\_{i}$ are set to the expected value of $0.5$ at test time.",
  "title": "Shake-Shake regularization",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "MAVL",
  "full_name": "Multiscale Attention ViT with Late fusion",
  "description": "Multiscale Attention ViT with Late fusion (MAVL) is a multi-modal network, trained with aligned image-text pairs, capable of performing targeted detection using human understandable natural language text queries. It utilizes multi-scale image features and uses deformable convolutions with late multi-modal fusion. The authors demonstrate excellent ability of MAVL as class-agnostic object detector when queried using general human understandable natural language command, such as \"all objects\", \"all entities\", etc.",
  "title": "Class-agnostic Object Detection with Multi-modal Transformer",
  "collection": "Multi-Modal Methods",
  "area": "Computer Vision"
}
{
  "name": "gCANS",
  "full_name": "Global Coupled Adaptive Number of Shots",
  "description": "**gCANS**, or **Global Coupled Adaptive Number of Shots**, is a variational quantum algorithm for stochastic gradient descent. It adaptively allocates shots for the measurement of each gradient component at each iteration. The optimizer uses a criterion for allocating shots that incorporates information about the overall scale of the shot cost for the iteration.",
  "title": "Adaptive shot allocation for fast convergence in variational quantum algorithms",
  "collection": "Quantum Methods",
  "area": "General"
}
{
  "name": "VisualBERT",
  "full_name": "VisualBERT",
  "description": "VisualBERT aims to reuse self-attention to implicitly align elements of the input text and regions in the input image. Visual embeddings are used to model images where the representations are represented by a bounding region in an image obtained from an object detector. These visual embeddings are constructed by summing three embeddings: 1) visual feature representation, 2) a segment embedding indicate whether it is an image embedding, and 3) position embedding. Essentially, image regions and language are combined with a Transformer to allow self-attention to discover implicit alignments between language and vision. VisualBERT is trained using COCO, which consists of images paired with captions. It is pre-trained using two objectives: masked language modeling objective and sentence-image prediction task. It can then be fine-tuned on different downstream tasks.",
  "title": "VisualBERT: A Simple and Performant Baseline for Vision and Language",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "ALDA",
  "full_name": "ALDA",
  "description": "**Adversarial-Learned Loss for Domain Adaptation** is a method for domain adaptation that combines adversarial learning with self-training. Specifically, the domain discriminator has to produce different corrected labels for different domains, while the feature generator aims to confuse the domain discriminator. The adversarial process finally leads to a proper confusion matrix on the target domain. In this way, ALDA takes the strengths of domain-adversarial learning and self-training based methods.",
  "title": "Adversarial-Learned Loss for Domain Adaptation",
  "collection": "Unpaired Image-to-Image Translation",
  "area": "Computer Vision"
}
{
  "name": "SymmNet",
  "full_name": "Domain-Symmetric Network",
  "description": "**Domain-Symmetric Network**, or **SymmNet**, is an algorithm for unsupervised multi-class domain adaptation. It features an adversarial strategy of domain confusion and discrimination.",
  "title": "Unsupervised Multi-Class Domain Adaptation: Theory, Algorithms, and Practice",
  "collection": "Domain Adaptation",
  "area": "General"
}
{
  "name": "BTmPG",
  "full_name": "BTmPG",
  "description": "**BTmPG**, or **Back-Translation guided multi-round Paraphrase Generation**, is a multi-round paraphrase generation method that leverages back-translation to guide paraphrase model during training and generates paraphrases in a multiround process. The model regards paraphrase generation as a monolingual translation task. Given a paraphrase pair $\\left(S\\_{0}, P\\right)$, which $S\\_{0}$ is the original/source sentence and $P$ is the target paraphrase given in the dataset. In the first round generation, we send $S\\_{0}$ into a paraphrase model to generate a paraphrase $S\\_{1}$. In the second round generation, we use the $S\\_{1}$ as the input of the model to generate a new paraphrase $S\\_{2}$. And so forth, in the $i$-th round generation, we send $S\\_{i−1}$ into the paraphrase model to generate $S\\_{i}$.\r\n.",
  "title": "Pushing Paraphrase Away from Original Sentence: A Multi-Round Paraphrase Generation Approach",
  "collection": "Paraphrase Generation Models",
  "area": "Natural Language Processing"
}
{
  "name": "TURL",
  "full_name": "TURL: Table Understanding through Representation Learning",
  "description": "Relational tables on the Web store a vast amount of knowledge. Owing to the wealth of such tables, there has been tremendous progress on a variety of tasks in the area of table understanding. However, existing work generally relies on heavily-engineered task- specific features and model architectures. In this paper, we present TURL, a novel framework that introduces the pre-training/fine- tuning paradigm to relational Web tables. During pre-training, our framework learns deep contextualized representations on relational tables in an unsupervised manner. Its universal model design with pre-trained representations can be applied to a wide range of tasks with minimal task-specific fine-tuning.\r\nSpecifically, we propose a structure-aware Transformer encoder to model the row-column structure of relational tables, and present a new Masked Entity Recovery (MER) objective for pre-training to capture the semantics and knowledge in large-scale unlabeled data. We systematically evaluate TURL with a benchmark consisting of 6 different tasks for table understanding (e.g., relation extraction, cell filling). We show that TURL generalizes well to all tasks and substantially outperforms existing methods in almost all instances.",
  "title": "TURL: Table Understanding through Representation Learning",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "BP-Transformer",
  "full_name": "BP-Transformer",
  "description": "The **BP-Transformer (BPT)** is a type of [Transformer](https://paperswithcode.com/method/transformer) that is motivated by the need to find a better balance between capability and computational complexity for self-attention. The architecture partitions the input sequence into different multi-scale spans via binary partitioning (BP). It incorporates an inductive bias of attending the context information from fine-grain to coarse-grain as the relative distance increases. The farther the context information is, the coarser its representation is.\r\nBPT can be regard as graph neural network, whose nodes are the multi-scale spans. A token node can attend the smaller-scale span for the closer context and the larger-scale span for the longer distance context. The representations of nodes are updated with [Graph Self-Attention](https://paperswithcode.com/method/graph-self-attention).",
  "title": "BP-Transformer: Modelling Long-Range Context via Binary Partitioning",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "STAC",
  "full_name": "STAC",
  "description": "**STAC** is a semi-supervised framework for visual object detection along with a data augmentation strategy. STAC deploys highly confident pseudo labels of localized objects from an unlabeled image and updates the model by enforcing consistency via strong augmentations. We generate pseudo labels (i.e., bounding boxes and their class labels) for unlabeled data using test-time inference, including NMS , of the teacher model trained with labeled data. We then compute unsupervised loss with respect to pseudo labels whose confidence scores are above a threshold $\\tau$ . The strong augmentations are applied for augmentation consistency during the model training. Target boxes are augmented when global geometric transformations are used.",
  "title": "A Simple Semi-Supervised Learning Framework for Object Detection",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "CBAM",
  "full_name": "Convolutional Block Attention Module",
  "description": "**Convolutional Block Attention Module (CBAM)** is an attention module for convolutional neural networks. Given an intermediate feature map, the module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement.\r\n\r\nGiven an intermediate feature map $\\mathbf{F} \\in \\mathbb{R}^{C×H×W}$ as input, CBAM sequentially infers a 1D channel attention map $\\mathbf{M}\\_{c} \\in \\mathbb{R}^{C×1×1}$ and a 2D spatial attention map $\\mathbf{M}\\_{s} \\in \\mathbb{R}^{1×H×W}$. The overall attention process can be summarized as:\r\n\r\n$$ \\mathbf{F}' = \\mathbf{M}\\_{c}\\left(\\mathbf{F}\\right) \\otimes \\mathbf{F} $$\r\n\r\n$$ \\mathbf{F}'' = \\mathbf{M}\\_{s}\\left(\\mathbf{F'}\\right) \\otimes \\mathbf{F'} $$\r\n\r\nDuring multiplication, the attention values are broadcasted (copied) accordingly: channel attention values are broadcasted along the spatial dimension, and vice versa. $\\mathbf{F}''$ is the final refined\r\noutput.",
  "title": "CBAM: Convolutional Block Attention Module",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Hierarchical MTL",
  "full_name": "Hierarchical Multi-Task Learning",
  "description": "Multi-task learning (MTL) introduces an inductive bias, based on a-priori relations between tasks: the trainable model is compelled to model more general dependencies by using the abovementioned relation as an important data feature. Hierarchical MTL, in which different tasks use different levels of the deep neural network, provides more effective inductive bias compared to “flat” MTL. Also, hierarchical MTL helps to solve the vanishing gradient problem in deep learning.",
  "title": "Deep multi-task learning with low level tasks supervised at lower layers",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "AttLWB",
  "full_name": "Attentional Liquid Warping Block",
  "description": "**Attentional Liquid Warping Block**, or **AttLWB**, is a module for human image synthesis GANs that propagates the source information - such as texture, style, color and face identity - in both image and feature spaces to the synthesized reference. It firstly learns similarities of the global features among all multiple sources features, and then it fuses the multiple sources features by a linear combination of the learned similarities and the multiple sources in the feature spaces. Finally, to better propagate the source identity (style, color, and texture) into the global stream, the fused source features are warped to the global stream by [Spatially-Adaptive Normalization](https://paperswithcode.com/method/spade) (SPADE).",
  "title": "Liquid Warping GAN with Attention: A Unified Framework for Human Image Synthesis",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Early Stopping",
  "full_name": "Early Stopping",
  "description": "**Early Stopping** is a regularization technique for deep neural networks that stops training when parameter updates no longer begin to yield improves on a validation set. In essence, we store and update the current best parameters during training, and when parameter updates no longer yield an improvement (after a set number of iterations) we stop training and use the last best parameters. It works as a regularizer by restricting the optimization procedure to a smaller volume of parameter space.\r\n\r\nImage Source: [Ramazan Gençay](https://www.researchgate.net/figure/Early-stopping-based-on-cross-validation_fig1_3302948)",
  "title": null,
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "FIERCE",
  "full_name": "Feature Information Entropy Regularized Cross Entropy",
  "description": "FIERCE is an entropic regularization on the **feature** space",
  "title": "Preserving Fine-Grain Feature Information in Classification via Entropic Regularization",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Local Patch Interaction",
  "full_name": "Local Patch Interaction",
  "description": "**Local Patch Interaction**, or **LPI**, is a module used for the [XCiT layer](https://paperswithcode.com/method/xcit-layer) to enable explicit communication across patches. LPI consists of two [depth-wise 3×3 convolutional layers](https://paperswithcode.com/method/depthwise-convolution) with [Batch Normalization](https://paperswithcode.com/method/batch-normalization) and [GELU](https://paperswithcode.com/method/gelu) non-linearity in between. Due to its depth-wise structure, the LPI block has a negligible overhead in terms of parameters, as well as a limited overhead in terms of throughput and memory usage during inference.",
  "title": "XCiT: Cross-Covariance Image Transformers",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Pyramidal Residual Unit",
  "full_name": "Pyramidal Residual Unit",
  "description": "A **Pyramidal Residual Unit** is a type of residual unit where the number of channels gradually increases as a function of the depth at which the layer occurs, which is similar to a pyramid structure of which the shape gradually widens from the top downwards. It was introduced as part of the [PyramidNet](https://paperswithcode.com/method/pyramidnet) architecture.",
  "title": "Deep Pyramidal Residual Networks",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "DExTra",
  "full_name": "DExTra",
  "description": "**DExTra**, or **Deep and Light-weight Expand-reduce Transformation**, is a light-weight expand-reduce transformation that enables learning wider representations efficiently.\r\n\r\nDExTra maps a $d\\_{m}$ dimensional input vector into a high dimensional space (expansion) and then\r\nreduces it down to a $d\\_{o}$ dimensional output vector (reduction) using $N$ layers of group transformations. During these expansion and reduction phases, DExTra uses group linear transformations because they learn local representations by deriving the output from a specific part of the input and are more efficient than linear transformations. To learn global representations, DExTra shares information between different groups in the group linear transformation using feature shuffling \r\n\r\nFormally, the DExTra transformation is controlled by five configuration parameters: (1) depth $N$, (2)\r\nwidth multiplier $m\\_{w}$, (3) input dimension $d\\_{m}$, (4) output dimension $d\\_{o}$, and (5) maximum groups $g\\_{max}$ in a group linear transformation. In the expansion phase, DExTra projects the $d\\_{m}$-dimensional input to a high-dimensional space, $d\\_{max} = m\\_{w}d\\_{m}$, linearly using $\\text{ceil}\\left(\\frac{N}{2}\\right)$ layers. In the reduction phase, DExTra projects the $d\\_{max}$-dimensional vector to a $d\\_{o}$-dimensional space using the remaining $N -\\text{ceil}\\left(\\frac{N}{2}\\right)$ layers. Mathematically, we define the output $Y$ at each layer $l$ as:\r\n\r\n$$ \\mathbf{Y}\\_{l} = \\mathcal{F}\\left(\\mathbf{X}, \\mathbf{W}^{l}, \\mathbf{b}^{l}, g^{l}\\right) \\text{ if } l=1 $$\r\n$$ \\mathbf{Y}\\_{l} = \\mathcal{F}\\left(\\mathcal{H}\\left(\\mathbf{X}, \\mathbf{Y}^{l-1}\\right), \\mathbf{W}^{l}, \\mathbf{b}^{l}, g^{l}\\right) \\text{ Otherwise } $$\r\n\r\nwhere the number of groups at each layer $l$ are computed as:\r\n\r\n$$ g^{l} = \\text{min}\\left(2^{l-1}, g\\_{max}\\right), 1 \\leq l \\leq \\text{ceil}\\left(N/2\\right) $$\r\n$$ g^{N-l}, \\text{Otherwise}$$\r\n\r\nIn the above equations, $\\mathcal{F}$ is a group linear transformation function. The function $\\mathcal{F}$ takes the input $\\left(\\mathbf{X} \\text{ or } \\mathcal{H}\\left(\\mathbf{X}, \\mathbf{Y}^{l-1}\\right) \\right)$, splits it into $g^{l}$ groups, and then applies a linear transformation with learnable parameters $\\mathbf{W}^{l}$ and bias $\\mathbf{b}^{l}$ to each group independently. The outputs of each group are then concatenated to produce the final output $\\mathbf{Y}^{l}$. The function $\\mathcal{H}$ first shuffles the output of each group in $\\mathbf{Y}^{l−1}$ and then combines it with the input $\\mathbf{X}$ using an input mixer connection.\r\n\r\nIn the authors' experiments, they use $g\\_{max} = \\text{ceil}\\left(\\frac{d\\_{m}}{32}\\right)$ so that each group has at least 32 input elements. Note that (i) group linear transformations reduce to linear transformations when $g^{l} = 1$, and (ii) DExTra is equivalent to a multi-layer perceptron when $g\\_{max} = 1$.",
  "title": "DeLighT: Deep and Light-weight Transformer",
  "collection": "Feedforward Networks",
  "area": "General"
}
{
  "name": "Cascade Corner Pooling",
  "full_name": "Cascade Corner Pooling",
  "description": "**Cascade Corner Pooling** is a pooling layer for object detection that builds upon the [corner pooling](https://paperswithcode.com/method/corner-pooling) operation. Corners are often outside the objects, which lacks local appearance features. [CornerNet](https://paperswithcode.com/method/cornernet) uses corner pooling to address this issue, where we find the maximum values on the boundary directions so as to determine corners. However, it makes corners sensitive to the edges. To address this problem, we need to let corners see the visual patterns of objects. Cascade corner pooling first looks along a boundary to find a boundary maximum value, then looks inside along the location of the boundary maximum value to find an internal maximum value, and finally, add the two maximum values together. By doing this, the corners obtain both the the boundary information and the visual patterns of objects.",
  "title": "CenterNet: Keypoint Triplets for Object Detection",
  "collection": "Pooling Operations",
  "area": "Computer Vision"
}
{
  "name": "LightAutoML",
  "full_name": "LightAutoML",
  "description": "**LightAutoML** is an AutoML solution targeted for financial services companies. A typical LightAutoML pipeline scheme is presented in the Figure, each pipeline containing:\r\n\r\n- Reader: object that receives raw data and task as input, calculates some useful metadata, performs initial data cleaning and decides about data manipulations that should be done before fitting different model types.\r\n\r\n- LightAutoML inner datasets that contains metadata and CV iterators that implements validation scheme for the datasets.\r\n\r\n- Multiple ML Pipelines that are stacked and/or blended to get a single prediction.\r\n\r\nAn ML pipeline in LightAutoML is one or multiple ML models that share a single data preprocessing and validation scheme. The preprocessing step may have up to two feature selection steps, a feature engineering step or even just be empty if no preprocessing is needed. The ML pipelines can be computed independently on the same datasets and then blended together using averaging (or weighted averaging). Alternatively, a stacking ensemble scheme can be used to build multi level ensemble architectures.",
  "title": "LightAutoML: AutoML Solution for a Large Financial Services Ecosystem",
  "collection": "AutoML",
  "area": "General"
}
{
  "name": "TFGW",
  "full_name": "Template based Graph Neural Network with Optimal Transport Distances",
  "description": "",
  "title": null,
  "collection": "Pooling Operations",
  "area": "Computer Vision"
}
{
  "name": "YOLOv4",
  "full_name": "YOLOv4",
  "description": "**YOLOv4** is a one-stage object detection model that improves on [YOLOv3](https://paperswithcode.com/method/yolov3) with several bags of tricks and modules introduced in the literature. The components section below details the tricks and modules used.",
  "title": "YOLOv4: Optimal Speed and Accuracy of Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "ALCN",
  "full_name": "Adaptive Locally Connected Neuron",
  "description": "The **Adaptive Locally Connected Neuron (ALCN)** is a topology aware, and locally adaptive -focusing neuron:\r\n\r\n$$a =  f\\:\\Bigg( \\sum_{i=1}^{m} w_{i}\\phi\\left( \\tau\\left(i\\right),\\Theta\\right) x_{i} + b \\Bigg) %f\\:\\Bigg(\\mathbf{X(W \\circ \\Phi) + b} \\Bigg) $$",
  "title": "An Adaptive Locally Connected Neuron Model: Focusing Neuron",
  "collection": "Feedforward Networks",
  "area": "General"
}
{
  "name": "Depthwise Dilated Separable Convolution",
  "full_name": "Depthwise Dilated Separable Convolution",
  "description": "A **Depthwise Dilated Separable Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) that combines [depthwise separability](https://paperswithcode.com/method/depthwise-separable-convolution) with the use of [dilated convolutions](https://paperswithcode.com/method/dilated-convolution).",
  "title": "ESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural Network",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "Demon ADAM",
  "full_name": "Demon ADAM",
  "description": "**Demon Adam** is a stochastic optimizer where the [Demon](https://paperswithcode.com/method/demon) momentum rule is applied to the [Adam](https://paperswithcode.com/method/adam) optimizer.\r\n\r\n$$ \\beta\\_{t} = \\beta\\_{init}\\cdot\\frac{\\left(1-\\frac{t}{T}\\right)}{\\left(1-\\beta\\_{init}\\right) + \\beta\\_{init}\\left(1-\\frac{t}{T}\\right)} $$\r\n\r\n$$ m\\_{t, i} = g\\_{t, i} + \\beta\\_{t}m\\_{t-1, i} $$\r\n\r\n$$ v\\_{t+1} = \\beta\\_{2}v\\_{t}  + \\left(1-\\beta\\_{2}\\right)g^{2}\\_{t} $$\r\n\r\n$$ \\theta_{t} = \\theta_{t-1} - \\eta\\frac{\\hat{m}\\_{t}}{\\sqrt{\\hat{v}\\_{t}} + \\epsilon}  $$",
  "title": "Demon: Improved Neural Network Training with Momentum Decay",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Margin ReLU",
  "full_name": "Margin Rectified Linear Unit",
  "description": "**Margin Rectified Linear Unit**, or **Margin ReLU**, is a type of activation function based on a [ReLU](https://paperswithcode.com/method/relu), but it has a negative threshold for negative values instead of a zero threshhold.",
  "title": "A Comprehensive Overhaul of Feature Distillation",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Inception-C",
  "full_name": "Inception-C",
  "description": "**Inception-C** is an image model block used in the [Inception-v4](https://paperswithcode.com/method/inception-v4) architecture.",
  "title": "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Squeeze-and-Excitation Block",
  "full_name": "Squeeze-and-Excitation Block",
  "description": "The **Squeeze-and-Excitation Block** is an architectural unit designed to improve the representational power of a network by enabling it to perform dynamic channel-wise feature recalibration. The process is:\r\n\r\n- The block has a convolutional block as an input.\r\n- Each channel is \"squeezed\" into a single numeric value using [average pooling](https://paperswithcode.com/method/average-pooling).\r\n- A dense layer followed by a [ReLU](https://paperswithcode.com/method/relu) adds non-linearity and output channel complexity is reduced by a ratio.\r\n- Another dense layer followed by a sigmoid gives each channel a smooth gating function.\r\n- Finally, we weight each feature map of the convolutional block based on the side network; the \"excitation\".",
  "title": "Squeeze-and-Excitation Networks",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "PolarMask",
  "full_name": "PolarMask",
  "description": "**PolarMask** is an anchor-box free and single-shot instance segmentation method. Specifically, PolarMask takes an image as input and predicts the distance from a sampled positive location (ie a candidate object's center) with respect to the object's contour at each angle, and then assembles the predicted points to produce the final mask. There are several benefits to the system: (1) The polar representation unifies instance segmentation (masks) and object detection (bounding boxes) into a single framework (2) Two modules are designed (i.e. soft polar centerness and polar IoU loss) to sample high-quality center examples and optimize polar contour regression, making the performance of PolarMask does not depend on the bounding box prediction results and more efficient in training. (3) PolarMask is fully convolutional and can be embedded into most off-the-shelf detection methods.",
  "title": "PolarMask++: Enhanced Polar Representation for Single-Shot Instance Segmentation and Beyond",
  "collection": "Instance Segmentation Modules",
  "area": "Computer Vision"
}
{
  "name": "GAN Feature Matching",
  "full_name": "GAN Feature Matching",
  "description": "**Feature Matching** is a regularizing objective for a generator in [generative adversarial networks](https://paperswithcode.com/methods/category/generative-adversarial-networks) that prevents it from overtraining on the current discriminator. Instead of directly maximizing the output of the discriminator, the new objective requires the generator to generate data that matches the statistics of the real data, where we use the discriminator only to specify the statistics that we think are worth matching. Specifically, we train the generator to match the expected value of the features on an intermediate layer of the discriminator. This is a natural choice of statistics for the generator to match, since by training the discriminator we ask it to find those features that are most discriminative of real data versus data generated by the current model.\r\n\r\nLetting $\\mathbf{f}\\left(\\mathbf{x}\\right)$ denote activations on an intermediate layer of the discriminator, our new objective for the generator is defined as: $ ||\\mathbb{E}\\_{x\\sim p\\_{data} } \\mathbf{f}\\left(\\mathbf{x}\\right) − \\mathbb{E}\\_{\\mathbf{z}∼p\\_{\\mathbf{z}}\\left(\\mathbf{z}\\right)}\\mathbf{f}\\left(G\\left(\\mathbf{z}\\right)\\right)||^{2}\\_{2} $. The discriminator, and hence\r\n$\\mathbf{f}\\left(\\mathbf{x}\\right)$, are trained as with vanilla GANs. As with regular [GAN](https://paperswithcode.com/method/gan) training, the objective has a fixed point where G exactly matches the distribution of training data.",
  "title": "Improved Techniques for Training GANs",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "QuantTree",
  "full_name": "QuantTree histograms",
  "description": "Given a training set drawn from an unknown $d$-variate probability distribution, QuantTree constructs a histogram by recursively splitting $\\mathbb{R}^d$. The splits are defined by a stochastic process so that each bin contains a certain proportion of the training set. These histograms can be used to define test statistics (e.g., the Pearson statistic) to tell whether a batch of data is drawn from $\\phi_0$ or not. The most crucial property of QuantTree is that the distribution of any statistic based on QuantTree histograms is independent of $\\phi_0$, thus enabling nonparametric statistical testing.",
  "title": "QuantTree: Histograms for Change Detection in Multivariate Data Streams",
  "collection": "Distribution Approximation",
  "area": "General"
}
{
  "name": "GPT-4",
  "full_name": "GPT-4",
  "description": "**GPT-4** is a transformer based model pre-trained to predict the next token in a document.",
  "title": "GPT-4 Technical Report",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "MADDPG",
  "full_name": "MADDPG",
  "description": "**MADDPG**, or **Multi-agent DDPG**, extends [DDPG](https://paperswithcode.com/method/ddpg) into a multi-agent policy gradient algorithm where decentralized agents learn a centralized critic based on the observations and actions of all agents. It leads to learned policies that only use local information (i.e. their own observations) at execution time, does not assume a differentiable model of the environment dynamics or any particular structure on the communication method between agents, and is applicable not only to cooperative interaction but to competitive or mixed interaction involving both physical and communicative behavior. The critic is augmented with extra information about the policies of other agents, while the actor only has access to local information. After training is completed, only the local actors are used at execution phase, acting in a decentralized manner.",
  "title": "Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "HDCGAN",
  "full_name": "High-resolution Deep Convolutional Generative Adversarial Networks",
  "description": "**HDCGAN**, or **High-resolution Deep Convolutional Generative Adversarial Networks**, is a [DCGAN](https://paperswithcode.com/method/dcgan) based architecture that achieves high-resolution image generation through the proper use of [SELU](https://paperswithcode.com/method/selu) activations. Glasses, a mechanism to arbitrarily improve the final [GAN](https://paperswithcode.com/method/gan) generated results by enlarging the input size by a telescope ζ is also set forth. \r\n\r\nA video showing the training procedure on CelebA-hq can be found [here](https://youtu.be/1XZB87W0SaY).",
  "title": "High-Resolution Deep Convolutional Generative Adversarial Networks",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "R-CNN",
  "full_name": "R-CNN",
  "description": "**R-CNN**, or **Regions with CNN Features**, is an object detection model that uses high-capacity CNNs to bottom-up region proposals in order to localize and segment objects. It uses [selective search](https://paperswithcode.com/method/selective-search) to identify a number of bounding-box object region candidates (“regions of interest”), and then extracts features from each region independently for classification.",
  "title": "Rich feature hierarchies for accurate object detection and semantic segmentation",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "ADELE",
  "full_name": "Adaptive Early-Learning Correction",
  "description": "Adaptive Early-Learning Correction for Segmentation from Noisy Annotations",
  "title": "Adaptive Early-Learning Correction for Segmentation from Noisy Annotations",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "SERAC",
  "full_name": "Semi-Parametric Editing with a Retrieval-Augmented Counterfac- tual Model",
  "description": "",
  "title": "Memory-Based Model Editing at Scale",
  "collection": "Information Retrieval Methods",
  "area": "General"
}
{
  "name": "CTracker",
  "full_name": "Chained-Tracker",
  "description": "**Chained-Tracker**, or **CTracker**,  is an online model for multiple-object tracking. It chains paired bounding boxes regression results estimated from overlapping nodes, of which each node covers two adjacent frames. The paired regression is made attentive by object-attention (brought by a detection module) and identity-attention (ensured by an ID verification module).\r\n\r\nThe joint attention module guides the paired boxes regression branch to focus on informative spatial regions with two other branches. One is the object classification branch, which predicts the confidence scores for the first box in the detected box pairs, and such scores are used to guide the regression branch to focus on the foreground regions. The other one is the ID verification branch whose prediction facilitates the regression branch to focus on regions corresponding to the same target. Finally, the bounding box pairs are filtered according to the classification confidence. Then, the generated box pairs belonging to the adjacent frame pairs could be associated using simple methods like IoU (Intersection over Union) matching according to their boxes in the common frame. In this way, the tracking process could be achieved by chaining all the adjacent frame pairs (i.e. chain nodes) sequentially.",
  "title": "Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking",
  "collection": "Multi-Object Tracking Models",
  "area": "Computer Vision"
}
{
  "name": "Gated Convolution Network",
  "full_name": "Gated Convolution Network",
  "description": "A **Gated Convolutional Network** is a type of language model that combines convolutional networks with a gating mechanism. Zero padding is used to ensure future context can not be seen. Gated convolutional layers can be stacked on top of other hierarchically. Model predictions are then obtained with an [adaptive softmax](https://paperswithcode.com/method/adaptive-softmax) layer.",
  "title": "Language Modeling with Gated Convolutional Networks",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "BigGAN-deep",
  "full_name": "BigGAN-deep",
  "description": "**BigGAN-deep** is a deeper version (4x) of [BigGAN](https://paperswithcode.com/method/biggan).  The main difference is a slightly differently designed [residual block](https://paperswithcode.com/method/residual-block). Here the $z$ vector is concatenated with the conditional vector without splitting it into chunks.  It is also based on residual blocks with bottlenecks. BigGAN-deep uses a different strategy than BigGAN aimed at preserving identity throughout the skip connections. In G, where the number of channels needs to be reduced, BigGAN-deep simply retains the first group of channels and drop the rest to produce the required number of channels. In D, where the number of channels should be increased, BigGAN-deep passes the input channels unperturbed, and concatenates them with the remaining channels produced by a 1 × 1 [convolution](https://paperswithcode.com/method/convolution). As far as the\r\nnetwork configuration is concerned, the discriminator is an exact reflection of the generator. \r\n\r\nThere are two blocks at each resolution (BigGAN uses one), and as a result BigGAN-deep is four times\r\ndeeper than BigGAN. Despite their increased depth, the BigGAN-deep models have significantly\r\nfewer parameters mainly due to the bottleneck structure of their residual blocks.",
  "title": "Large Scale GAN Training for High Fidelity Natural Image Synthesis",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "SETSe",
  "full_name": "Strain Elevation Tension Spring embedding",
  "description": "SETSe is a deterministic physics based graph embedding algorithm. It embeds weighted feature rich networks. It treats each edge as a spring and each node as a bead whose movement is constrained by the graph adjacency matrix so that the nodes move in parallel planes enforcing a minimum distance between neighboring nodes. The node features act as forces moving the nodes up and down. The network converges to the embedded state when the force produced by each node is equal and opposite to the sum of the forces exerted by its edges, creating a net force of 0.\r\n\r\nSETSe has no conventional loss function and does not attempt to place similar nodes close to each other.",
  "title": null,
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "Normalizing Flows",
  "full_name": "Normalizing Flows",
  "description": "**Normalizing Flows** are a method for constructing complex distributions by transforming a\r\nprobability density through a series of invertible mappings. By repeatedly applying the rule for change of variables, the initial density ‘flows’ through the sequence of invertible mappings. At the end of this sequence we obtain a valid probability distribution and hence this type of flow is referred to as a normalizing flow.\r\n\r\nIn the case of finite flows, the basic rule for the transformation of densities considers an invertible, smooth mapping $f : \\mathbb{R}^{d} \\rightarrow \\mathbb{R}^{d}$ with inverse $f^{-1} = g$, i.e. the composition $g \\cdot f\\left(z\\right) = z$. If we use this mapping to transform a random variable $z$ with distribution $q\\left(z\\right)$, the resulting random variable $z' = f\\left(z\\right)$ has a distribution:\r\n\r\n$$ q\\left(\\mathbf{z}'\\right) = q\\left(\\mathbf{z}\\right)\\bigl\\vert{\\text{det}}\\frac{\\delta{f}^{-1}}{\\delta{\\mathbf{z'}}}\\bigr\\vert = q\\left(\\mathbf{z}\\right)\\bigl\\vert{\\text{det}}\\frac{\\delta{f}}{\\delta{\\mathbf{z}}}\\bigr\\vert ^{-1} $$\r\n\f\r\nwhere the last equality can be seen by applying the chain rule (inverse function theorem) and is a property of Jacobians of invertible functions. We can construct arbitrarily complex densities by composing several simple maps and successively applying the above equation. The density $q\\_{K}\\left(\\mathbf{z}\\right)$ obtained by successively transforming a random variable $z\\_{0}$ with distribution $q\\_{0}$ through a chain of $K$ transformations $f\\_{k}$ is:\r\n\r\n$$ z\\_{K} = f\\_{K} \\cdot \\dots \\cdot f\\_{2} \\cdot f\\_{1}\\left(z\\_{0}\\right) $$\r\n\r\n$$ \\ln{q}\\_{K}\\left(z\\_{K}\\right) = \\ln{q}\\_{0}\\left(z\\_{0}\\right) − \\sum^{K}\\_{k=1}\\ln\\vert\\det\\frac{\\delta{f\\_{k}}}{\\delta{\\mathbf{z\\_{k-1}}}}\\vert $$\r\n\f\r\nThe path traversed by the random variables $z\\_{k} = f\\_{k}\\left(z\\_{k-1}\\right)$ with initial distribution $q\\_{0}\\left(z\\_{0}\\right)$ is called the flow and the path formed by the successive distributions $q\\_{k}$ is a normalizing flow.",
  "title": "Variational Inference with Normalizing Flows",
  "collection": "Distribution Approximation",
  "area": "General"
}
{
  "name": "Forward gradient",
  "full_name": "Forward gradient",
  "description": "Forward gradients are unbiased estimators of the gradient $\\nabla f(\\theta)$ for a function $f: \\mathbb{R}^n \\rightarrow \\mathbb{R}$, given by $g(\\theta) = \\langle \\nabla f(\\theta) , v \\rangle v$. \r\n\r\nHere $v = (v_1, \\ldots, v_n)$ is a random vector, which must satisfy the following conditions in order for $g(\\theta)$ to be an unbiased estimator of $\\nabla f(\\theta)$\r\n\r\n* $v_i \\perp v_j$ for all $i \\neq j$\r\n* $\\mathbb{E}[v_i] = 0$ for all $i$\r\n* $\\mathbb{V}[v_i] = 1$ for all $i$\r\n\r\nForward gradients can be computed with a single jvp (Jacobian Vector Product), which enables the use of the forward mode of autodifferentiation instead of the usual reverse mode, which has worse computational characteristics.",
  "title": "Gradients without Backpropagation",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "GAN Hinge Loss",
  "full_name": "GAN Hinge Loss",
  "description": "The **GAN Hinge Loss** is a hinge loss based loss function for [generative adversarial networks](https://paperswithcode.com/methods/category/generative-adversarial-networks):\r\n\r\n$$ L\\_{D} = -\\mathbb{E}\\_{\\left(x, y\\right)\\sim{p}\\_{data}}\\left[\\min\\left(0, -1 + D\\left(x, y\\right)\\right)\\right] -\\mathbb{E}\\_{z\\sim{p\\_{z}}, y\\sim{p\\_{data}}}\\left[\\min\\left(0, -1 - D\\left(G\\left(z\\right), y\\right)\\right)\\right] $$\r\n\r\n$$ L\\_{G} = -\\mathbb{E}\\_{z\\sim{p\\_{z}}, y\\sim{p\\_{data}}}D\\left(G\\left(z\\right), y\\right) $$",
  "title": "Geometric GAN",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "CoOp",
  "full_name": "Context Optimization",
  "description": "**CoOp**, or **Context Optimization**, is an automated prompt engineering method that avoids manual prompt tuning by modeling context words with continuous vectors that are end-to-end learned from data. The context could be shared among all classes or designed to be class-specific. During training, we simply minimize the prediction error using the cross-entropy loss with respect to the learnable context vectors, while keeping the pre-trained parameters fixed. The gradients can be back-propagated all the way through the text encoder, distilling the rich knowledge encoded in the parameters for learning task-relevant context.",
  "title": "Learning to Prompt for Vision-Language Models",
  "collection": "Prompt Engineering",
  "area": "General"
}
{
  "name": "ADA",
  "full_name": "Adaptive Discriminator Augmentation",
  "description": "",
  "title": "Training Generative Adversarial Networks with Limited Data",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "State-Aware Tracker",
  "full_name": "State-Aware Tracker",
  "description": "**State-Aware Tracker** is a pipeline for semi-supervised video object segmentation. It takes each target object as a tracklet, which not only makes the pipeline more efficient but also filters distractors to facilitate target modeling. For more stable and robust performance over video sequences, SAT gets awareness for each state and makes self-adaptation via two feedback loops. One loop assists SAT in generating more stable tracklets. The other loop helps to construct a more robust and holistic target representation.",
  "title": "State-Aware Tracker for Real-Time Video Object Segmentation",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "K-Net",
  "full_name": "K-Net",
  "description": "**K-Net** is a framework for unified semantic and instance segmentation that segments both instances and semantic categories consistently by a group of learnable kernels, where each kernel is responsible for generating a mask for either a potential instance or a stuff class. It begins with a set of kernels that are randomly initialized, and learns the kernels in accordance to the segmentation targets at hand, namely, semantic kernels for semantic categories and instance kernels for instance identities. A simple combination of semantic kernels and instance kernels allows panoptic segmentation naturally. In the forward pass, the kernels perform [convolution](https://paperswithcode.com/method/convolution) on the image features to obtain the corresponding segmentation predictions.\r\n\r\nK-Net is formulated so that it dynamically updates the kernels to make them conditional to their activations on the image. Such a content-aware mechanism is crucial to ensure that each kernel, especially an instance kernel, responds accurately to varying objects in an image. Through applying this adaptive kernel update strategy iteratively, K-Net significantly improves the discriminative ability of the kernels and boosts the final segmentation performance. It is noteworthy that this strategy universally applies to kernels for all the segmentation tasks.\r\n\r\nIt also utilises a bipartite matching strategy to assign learning targets for each kernel. This training approach is advantageous to conventional training strategies as it builds a one-to-one mapping between kernels and instances in an image. It thus resolves the problem of dealing with a varying number of instances in an image. In addition, it is purely mask-driven without involving boxes. Hence, K-Net is naturally [NMS](https://paperswithcode.com/method/non-maximum-suppression)-free and box-free, which is appealing to real-time applications.",
  "title": "K-Net: Towards Unified Image Segmentation",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "High-level backbone",
  "full_name": "High-level backbone",
  "description": "",
  "title": "EfficientPose: Scalable single-person pose estimation",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "Denoising Score Matching",
  "full_name": "Denoising Score Matching",
  "description": "Training a denoiser on signals gives you a powerful prior over this signal that you can then use to sample examples of this signal.",
  "title": "Generative Modeling by Estimating Gradients of the Data Distribution",
  "collection": "Generative Training",
  "area": "Computer Vision"
}
{
  "name": "DropBlock",
  "full_name": "DropBlock",
  "description": "**DropBlock** is a structured form of [dropout](https://paperswithcode.com/method/dropout) directed at regularizing convolutional networks. In DropBlock, units in a contiguous region of a feature map are dropped together.  As DropBlock discards features in a correlated area, the networks must look elsewhere for evidence to fit the data.",
  "title": "DropBlock: A regularization method for convolutional networks",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Projection Discriminator",
  "full_name": "Projection Discriminator",
  "description": "A **Projection Discriminator** is a type of discriminator for generative adversarial networks. It is motivated by a probabilistic model in which the distribution of the conditional variable $\\textbf{y}$ given $\\textbf{x}$ is discrete or uni-modal continuous distributions.\r\n\r\nIf we look at the original solution for the loss function $\\mathcal{L}\\_{D}$ in the vanilla GANs, we can decompose it into the sum of two log-likelihood ratios:\r\n\r\n$$ f^{*}\\left(\\mathbf{x}, \\mathbf{y}\\right) = \\log\\frac{q\\left(\\mathbf{x}\\mid{\\mathbf{y}}\\right)q\\left(\\mathbf{y}\\right)}{p\\left(\\mathbf{x}\\mid{\\mathbf{y}}\\right)p\\left(\\mathbf{y}\\right)} = \\log\\frac{q\\left(\\mathbf{y}\\mid{\\mathbf{x}}\\right)}{p\\left(\\mathbf{y}\\mid{\\mathbf{x}}\\right)} + \\log\\frac{q\\left(\\mathbf{x}\\right)}{p\\left(\\mathbf{x}\\right)}  = r\\left(\\mathbf{y\\mid{x}}\\right) + r\\left(\\mathbf{x}\\right) $$\r\n\r\nWe can model the log likelihood ratio $r\\left(\\mathbf{y\\mid{x}}\\right)$ and  $r\\left(\\mathbf{x}\\right)$ by some parametric functions $f\\_{1}$ and $f\\_{2}$ respectively. If we make a standing assumption that $p\\left(y\\mid{x}\\right)$ and $q\\left(y\\mid{x}\\right)$ are simple distributions like those that are Gaussian or discrete log linear on the feature space, then the parametrization of the following form becomes natural:\r\n\r\n$$ f\\left(\\mathbf{x}, \\mathbf{y}; \\theta\\right) = f\\_{1}\\left(\\mathbf{x}, \\mathbf{y}; \\theta\\right) + f\\_{2}\\left(\\mathbf{x}; \\theta\\right) = \\mathbf{y}^{T}V\\phi\\left(\\mathbf{x}; \\theta\\_{\\phi}\\right) + \\psi\\left(\\phi(\\mathbf{x}; \\theta\\_{\\phi}); \\theta\\_{\\psi}\\right) $$\r\n\r\nwhere $V$ is the embedding matrix of $y$, $\\phi\\left(·, \\theta\\_{\\phi}\\right)$ is a vector output function of $x$, and $\\psi\\left(·, \\theta\\_{\\psi}\\right)$ is a scalar function of the same $\\phi\\left(\\mathbf{x}; \\theta\\_{\\phi}\\right)$ that appears in $f\\_{1}$. The learned parameters $\\theta = ${$V, \\theta\\_{\\phi}, \\theta\\_{\\psi}$} are trained to optimize the adversarial loss. This model of the discriminator is the projection.",
  "title": "cGANs with Projection Discriminator",
  "collection": "Discriminators",
  "area": "General"
}
{
  "name": "CPN",
  "full_name": "Contour Proposal Network",
  "description": "The Contour Proposal Network (CPN) detects possibly overlapping objects in an image while simultaneously fitting pixel-precise closed object contours. The CPN can incorporate state of the art object detection architectures as backbone networks into a fast single-stage instance segmentation model that can be trained end-to-end.",
  "title": "Contour Proposal Networks for Biomedical Instance Segmentation",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "LLaMA",
  "full_name": "LLaMA",
  "description": "**LLaMA** is a collection of foundation language models ranging from 7B to 65B parameters. It is based on the transformer architecture with various improvements that were subsequently proposed. The main difference with the original architecture are listed below.\r\n\r\n- RMSNorm normalizing function is used to improve the training stability, by normalizing the input of each transformer sub-layer, instead of normalizing the output.\r\n- The ReLU non-linearity is replaced by the SwiGLU activation function to improve performance.\r\n- Absolute positional embeddings are removed and instead rotary positional embeddings (RoPE) are added at each layer of the network.",
  "title": "LLaMA: Open and Efficient Foundation Language Models",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "Spatial Group-wise Enhance",
  "full_name": "Spatial Group-wise Enhance",
  "description": "**Spatial Group-wise Enhance** is a module for convolutional neural networks that can adjust the\r\nimportance of each sub-feature by generating an attention factor for each spatial location in each semantic group, so that every individual group can autonomously enhance its learnt expression and suppress possible noise\r\n\r\nInside each feature group, we model a spatial enhance mechanism inside each feature group, by scaling the feature vectors over all the locations with an attention mask. This attention mask is designed to suppress the possible noise and highlight the correct semantic feature regions. Different from other popular attention methods, it utilises the similarity between the global statistical feature and the local ones of each location as the source of generation for the attention masks.",
  "title": "Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Collaborative Distillation",
  "full_name": "Collaborative Distillation",
  "description": "**Collaborative Distillation** is a new knowledge distillation method (named Collaborative Distillation) for encoder-decoder based neural style transfer to reduce the number of convolutional filters. The main idea is underpinned by a finding that the encoder-decoder pairs construct an exclusive collaborative relationship, which is regarded as a new kind of knowledge for style transfer models.",
  "title": "Collaborative Distillation for Ultra-Resolution Universal Style Transfer",
  "collection": "Knowledge Distillation",
  "area": "General"
}
{
  "name": "PixelCNN",
  "full_name": "PixelCNN",
  "description": "A **PixelCNN** is a generative model that uses autoregressive connections to model images pixel by pixel, decomposing the joint image distribution as a product of conditionals. PixelCNNs are much faster to train than [PixelRNNs](https://paperswithcode.com/method/pixelrnn) because convolutions are inherently easier to parallelize; given the vast number of pixels present in large image datasets this is an important advantage.",
  "title": "Pixel Recurrent Neural Networks",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "SNIP",
  "full_name": "SNIP",
  "description": "**SNIP**, or **Scale Normalization for Image Pyramids**, is a multi-scale training scheme that selectively back-propagates the gradients of object instances of different sizes as a function of the image scale. SNIP is a modified version of MST where only the object instances that have a resolution close to the pre-training dataset, which is typically 224x224, are used for training the detector. In multi-scale training (MST), each image is observed at different resolutions therefore, at a high resolution (like 1400x2000) large objects are hard to classify and at a low resolution (like 480x800) small objects are hard to classify. Fortunately, each object instance appears at several different scales and some of those appearances fall in the desired scale range. In order to eliminate extreme scale objects, either too large or too small, training is only performed on objects that fall in the desired scale range and the remainder are simply ignored during back-propagation. Effectively, SNIP uses all the object instances during training, which helps capture all the variations in appearance and\r\npose, while reducing the domain-shift in the scale-space for the pre-trained network.",
  "title": "An Analysis of Scale Invariance in Object Detection - SNIP",
  "collection": "Multi-Scale Training",
  "area": "Computer Vision"
}
{
  "name": "TAPEX",
  "full_name": "Table Pre-training via Execution",
  "description": "TAPEX is a conceptually simple and empirically powerful pre-training approach to empower existing models with table reasoning skills. TAPEX realizes table pre-training by learning a neural SQL executor over a synthetic corpus, which is obtained by automatically synthesising executable SQL queries.",
  "title": "TAPEX: Table Pre-training via Learning a Neural SQL Executor",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "GFSA",
  "full_name": "Graph Finite-State Automaton",
  "description": "**Graph Finite-State Automaton**, or **GFSA**, is a differentiable layer for learning graph structure that adds a new edge type (expressed as a weighted adjacency matrix) to a base graph. This layer can be trained end-to-end to add derived relationships (edges) to arbitrary graph-structured data based on performance on a downstream task.",
  "title": "Learning Graph Structure With A Finite-State Automaton Layer",
  "collection": "Graph Representation Learning",
  "area": "Graphs"
}
{
  "name": "CARAFE",
  "full_name": "CARAFE",
  "description": "**Content-Aware ReAssembly of FEatures (CARAFE)** is an operator for feature upsampling in convolutional neural networks. CARAFE has several appealing properties: (1) Large field of view. Unlike previous works (e.g. bilinear interpolation) that only exploit subpixel neighborhood, CARAFE can aggregate contextual information within a large receptive field. (2) Content-aware handling. Instead of using a fixed kernel for all samples (e.g. deconvolution), CARAFE enables instance-specific content-aware handling, which generates adaptive kernels on-the-fly. (3) Lightweight and fast to compute.",
  "title": "CARAFE: Content-Aware ReAssembly of FEatures",
  "collection": "Feature Upsampling",
  "area": "Computer Vision"
}
{
  "name": "Routing Attention",
  "full_name": "Routing Attention",
  "description": "**Routed Attention** is an attention pattern proposed as part of the [Routing Transformer](https://paperswithcode.com/method/routing-transformer) architecture.  Each attention module\r\nconsiders a clustering of the space: the current timestep only attends to context belonging to the same cluster. In other word, the current time-step query is routed to a limited number of context through its cluster assignment. This can be contrasted with [strided](https://paperswithcode.com/method/strided-attention) attention patterns and those proposed with the [Sparse Transformer](https://paperswithcode.com/method/sparse-transformer).\r\n\r\nIn the image to the right, the rows represent the outputs while the columns represent the inputs. The different colors represent cluster memberships for the output token.",
  "title": "Efficient Content-Based Sparse Attention with Routing Transformers",
  "collection": "Attention Patterns",
  "area": "Natural Language Processing"
}
{
  "name": "AlphaZero",
  "full_name": "AlphaZero",
  "description": "**AlphaZero** is a reinforcement learning agent for playing board games such as Go, chess, and shogi. ",
  "title": "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm",
  "collection": "Board Game Models",
  "area": "Reinforcement Learning"
}
{
  "name": "EGT",
  "full_name": "Edge-augmented Graph Transformer",
  "description": "Transformer neural networks have achieved state-of-the-art results for unstructured data such as text and images but their adoption for graph-structured data has been limited. This is partly due to the difficulty of incorporating complex structural information in the basic transformer framework. We propose a simple yet powerful extension to the transformer - residual edge channels. The resultant framework, which we call Edge-augmented Graph Transformer (EGT), can directly accept, process and output structural information as well as node information. It allows us to use global self-attention, the key element of transformers, directly for graphs and comes with the benefit of long-range interaction among nodes. Moreover, the edge channels allow the structural information to evolve from layer to layer, and prediction tasks on edges/links can be performed directly from the output embeddings of these channels. In addition, we introduce a generalized positional encoding scheme for graphs based on Singular Value Decomposition which can improve the performance of EGT. Our framework, which relies on global node feature aggregation, achieves better performance compared to Convolutional/Message-Passing Graph Neural Networks, which rely on local feature aggregation within a neighborhood. We verify the performance of EGT in a supervised learning setting on a wide range of experiments on benchmark datasets. Our findings indicate that convolutional aggregation is not an essential inductive bias for graphs and global self-attention can serve as a flexible and adaptive alternative.",
  "title": "Global Self-Attention as a Replacement for Graph Convolution",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Charformer",
  "full_name": "Charformer",
  "description": "**Charformer** is a type of [Transformer](https://paperswithcode.com/methods/category/transformers) model that learns a subword tokenization end-to-end as part of the model. Specifically it uses [GBST](https://paperswithcode.com/method/gradient-based-subword-tokenization) that automatically learns latent subword representations from characters in a data-driven fashion. Following GBST, the soft subword sequence is passed through [Transformer](https://paperswithcode.com/method/transformer) layers.",
  "title": "Charformer: Fast Character Transformers via Gradient-based Subword Tokenization",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "LinComb",
  "full_name": "Linear Combination of Activations",
  "description": "The **Linear Combination of Activations**, or **LinComb**, is a type of activation function that has trainable parameters and uses the linear combination of other activation functions.\r\n\r\n$$LinComb(x) = \\sum\\limits_{i=0}^{n} w_i \\mathcal{F}_i(x)$$",
  "title": "Trainable Activations for Image Classification",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "BinaryBERT",
  "full_name": "BinaryBERT",
  "description": "**BinaryBERT** is a [BERT](https://paperswithcode.com/method/bert)-variant that applies quantization in the form of weight binarization. Specifically, ternary weight splitting is proposed which initializes BinaryBERT by equivalently splitting from a half-sized ternary network. To obtain BinaryBERT, we first train a half-sized [ternary BERT](https://paperswithcode.com/method/ternarybert) model, and then apply a [ternary weight splitting](https://paperswithcode.com/method/ternary-weight-splitting) operator to obtain the latent full-precision and quantized weights as the initialization of the full-sized BinaryBERT. We then fine-tune BinaryBERT for further refinement.",
  "title": "BinaryBERT: Pushing the Limit of BERT Quantization",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Lower Bound on Transmission using Non-Linear Bounding Function in Single Image Dehazing",
  "full_name": "Lower Bound on Transmission using Non-Linear Bounding Function in Single Image Dehazing",
  "description": "",
  "title": "Lower Bound on Transmission Using Non-Linear Bounding Function in Single Image Dehazing",
  "collection": "Image Denoising Models",
  "area": "Computer Vision"
}
{
  "name": "Reversible Residual Block",
  "full_name": "Reversible Residual Block",
  "description": "**Reversible Residual Blocks** are skip-connection blocks that learn *reversible* residual functions with reference to the layer inputs. It is proposed as part of the [RevNet](https://paperswithcode.com/method/revnet) CNN architecture. Units in each layer are partitioned into two groups, denoted $x\\_{1}$ and $x\\_{2}$; the authors find what works best is partitioning the channels. Each reversible block takes inputs $\\left(x\\_{1}, x\\_{2}\\right)$ and produces outputs $\\left(y\\_{1}, y\\_{2}\\right)$ according to the following additive coupling rules – inspired by the transformation in [NICE](https://paperswithcode.com/method/nice) (nonlinear independent components estimation) – and residual functions $F$ and $G$ analogous to those in standard [ResNets](https://paperswithcode.com/method/resnet):\r\n\r\n$$y\\_{1} = x\\_{1} + F\\left(x\\_{2}\\right)$$\r\n$$y\\_{2} = x\\_{2} + G\\left(y\\_{1}\\right)$$\r\n\r\nEach layer’s activations can be reconstructed from the next layer’s activations as follows:\r\n\r\n$$ x\\_{2} = y\\_{2} − G\\left(y\\_{1}\\right)$$\r\n$$ x\\_{1} = y\\_{1} − F\\left(x\\_{2}\\right)$$",
  "title": "The Reversible Residual Network: Backpropagation Without Storing Activations",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "Pose Disentangling",
  "full_name": "Pose-Appearance Disentangling",
  "description": "A method to disentangle pose from other factors in a scene.",
  "title": "Domain Knowledge-Informed Self-Supervised Representations for Workout Form Assessment",
  "collection": "Pose Estimation Models",
  "area": "Computer Vision"
}
{
  "name": "IPA-GNN",
  "full_name": "Instruction Pointer Attention Graph Neural Network",
  "description": "**Instruction Pointer Attention Graph Neural Network**, or **IPA-GNN**, is a learning-interpreter neural network (LNN) based on GNNs for learning to execute programmes. It achieves improved systematic generalization on the task of learning to execute programs using control flow graphs. The model arises by considering RNNs operating on program traces with branch decisions as latent variables. The IPA-GNN can be seen either as a continuous relaxation of the RNN model or as a GNN variant more tailored to execution.",
  "title": "Learning to Execute Programs with Instruction Pointer Attention Graph Neural Networks",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "AlignPS",
  "full_name": "Feature-Aligned Person Search Network",
  "description": "**AlignPS**, or **Feature-Aligned Person Search Network**, is an anchor-free framework for efficient person search. The model employs the typical architecture of an anchor-free detection model (i.e., [FCOS](https://paperswithcode.com/method/fcos)). An aligned feature aggregation (AFA) module is designed to make the model focus more on the re-id subtask. Specifically, AFA reshapes some building blocks of [FPN](https://paperswithcode.com/method/fpn) to overcome the issues of region and scale misalignment in re-id feature learning. A [deformable convolution](https://paperswithcode.com/method/deformable-convolution) is exploited to make the re-id embeddings adaptively aligned with the foreground regions. A feature fusion scheme is designed to better aggregate features from different FPN levels, which makes the re-id features more robust to scale variations. The training procedures of re-id and detection are also optimized to place more emphasis on generating robust re-id embeddings.",
  "title": "Efficient Person Search: An Anchor-Free Approach",
  "collection": "Person Search Models",
  "area": "Computer Vision"
}
{
  "name": "Linear Warmup With Linear Decay",
  "full_name": "Linear Warmup With Linear Decay",
  "description": "**Linear Warmup With Linear Decay** is a learning rate schedule in which we increase the learning rate linearly for $n$ updates and then linearly decay afterwards.",
  "title": null,
  "collection": "Learning Rate Schedules",
  "area": "General"
}
{
  "name": "DGRF",
  "full_name": "Difference of Gaussian Random Forest",
  "description": "",
  "title": "Morphology Decoder: A Machine Learning Guided 3D Vision Quantifying Heterogenous Rock Permeability for Planetary Surveillance and Robotic Functions",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "Separate And Diffuse",
  "full_name": "Separate And Diffuse",
  "description": "",
  "title": "Separate And Diffuse: Using a Pretrained Diffusion Model for Improving Source Separation",
  "collection": "Speech Separation Models",
  "area": "Audio"
}
{
  "name": "NeuralRecon",
  "full_name": "NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video",
  "description": "**NeuralRecon** is a framework for real-time 3D scene reconstruction from a monocular video. Unlike previous methods that estimate single-view depth maps separately on each key-frame and fuse them later, NeuralRecon proposes to directly reconstruct local surfaces represented as sparse TSDF volumes for each video fragment sequentially by a neural network. A learning-based TSDF fusion module based on gated recurrent units is used to guide the network to fuse features from previous fragments. This design allows the network to capture local smoothness prior and global shape prior of 3D surfaces.",
  "title": "NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video",
  "collection": "3D Reconstruction",
  "area": "Computer Vision"
}
{
  "name": "Entropy Regularization",
  "full_name": "Entropy Regularization",
  "description": "**Entropy Regularization** is a type of regularization used in [reinforcement learning](https://paperswithcode.com/methods/area/reinforcement-learning). For on-policy policy gradient based methods like [A3C](https://paperswithcode.com/method/a3c), the same mutual  reinforcement behaviour leads to a highly-peaked $\\pi\\left(a\\mid{s}\\right)$ towards a few actions or action sequences, since it is easier for the actor and critic to overoptimise to a small portion of the environment. To reduce this problem, entropy regularization adds an entropy term to the loss to promote action diversity:\r\n\r\n$$H(X) = -\\sum\\pi\\left(x\\right)\\log\\left(\\pi\\left(x\\right)\\right) $$\r\n\r\nImage Credit: Wikipedia",
  "title": "Asynchronous Methods for Deep Reinforcement Learning",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Vokenization",
  "full_name": "Vokenization",
  "description": "**Vokenization** is an approach for extrapolating multimodal alignments to language-only data by contextually mapping language tokens to their related images (\"vokens\") by retrieval. Instead of directly supervising the language model with visually grounded language datasets (e.g., MS COCO) these relative small datasets are used to train the vokenization processor (i.e. the vokenizer). Vokens are generated for large language corpora (e.g., English Wikipedia), and the visually-supervised language model takes the\r\ninput supervision from these large datasets, thus bridging the gap between different data sources.",
  "title": "Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision",
  "collection": "Multi-Modal Methods",
  "area": "Computer Vision"
}
{
  "name": "Random Erasing",
  "full_name": "Random Erasing",
  "description": "Random Erasing is a data augmentation method for training the convolutional neural network (CNN), which randomly selects a rectangle region in an image and erases its pixels with random values. In this process, training images with various levels of occlusion are generated, which reduces the risk of over-fitting and makes the model robust to occlusion. Random Erasing is parameter learning free, easy to implement, and can be integrated with most of the CNN-based recognition models. Random Erasing is complementary to commonly used data augmentation techniques such as random cropping and flipping, and can be implemented in various vision tasks, such as image classification, object detection, semantic segmentation.",
  "title": "Random Erasing Data Augmentation",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "Style Transfer Module",
  "full_name": "Style Transfer Module",
  "description": "Modules used in [GAN](https://paperswithcode.com/method/gan)'s style transfer.",
  "title": "Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization",
  "collection": "Generative Adversarial Networks",
  "area": "Computer Vision"
}
{
  "name": "CEAL",
  "full_name": "Compute-Efficient Active Learning",
  "description": "The motivation of this work is based on the hypothesis that historical values of the acquisition function are good predictors of their future values. This idea is quite intuitive. For example, once a model is certain about its predictions on a given sample, this fact is unlikely to change. This can be explained by the randomness in the training, especially when using small acquisition sizes.",
  "title": null,
  "collection": "Active Learning",
  "area": "General"
}
{
  "name": "Double DQN",
  "full_name": "Double DQN",
  "description": "A **Double Deep Q-Network**, or **Double DQN** utilises [Double Q-learning](https://paperswithcode.com/method/double-q-learning) to reduce overestimation by decomposing the max operation in the target into action selection and action evaluation. We evaluate the greedy policy according to the online network, but we use the target network to estimate its value.  The update is the same as for [DQN](https://paperswithcode.com/method/dqn), but replacing the target $Y^{DQN}\\_{t}$ with:\r\n\r\n$$ Y^{DoubleDQN}\\_{t} = R\\_{t+1}+\\gamma{Q}\\left(S\\_{t+1}, \\arg\\max\\_{a}Q\\left(S\\_{t+1}, a; \\theta\\_{t}\\right);\\theta\\_{t}^{-}\\right) $$\r\n\r\nCompared to the original formulation of Double [Q-Learning](https://paperswithcode.com/method/q-learning), in Double DQN the weights of the second network $\\theta^{'}\\_{t}$ are replaced with the weights of the target network $\\theta\\_{t}^{-}$ for the evaluation of the current greedy policy.",
  "title": "Deep Reinforcement Learning with Double Q-learning",
  "collection": "Q-Learning Networks",
  "area": "Reinforcement Learning"
}
{
  "name": "CANINE",
  "full_name": "CANINE",
  "description": "**CANINE** is a pre-trained encoder for language understanding that operates directly on character sequences—without explicit tokenization or vocabulary—and a pre-training strategy with soft inductive biases in place of hard token boundaries. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep [transformer](https://paperswithcode.com/method/transformer) stack, which encodes context.",
  "title": "CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "GIC",
  "full_name": "Graph InfoClust",
  "description": "",
  "title": "Graph InfoClust: Leveraging cluster-level node information for unsupervised graph representation learning",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "MAE",
  "full_name": "Masked autoencoder",
  "description": "",
  "title": "Masked Autoencoders Are Scalable Vision Learners",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "Ontology",
  "full_name": "Ontology",
  "description": "",
  "title": "Ontology-Based Production Simulation with OntologySim",
  "collection": "Knowledge Distillation",
  "area": "General"
}
{
  "name": "ResNet-D",
  "full_name": "ResNet-D",
  "description": "**ResNet-D** is a modification on the [ResNet](https://paperswithcode.com/method/resnet) architecture that utilises an [average pooling](https://paperswithcode.com/method/average-pooling) tweak for downsampling. The motivation is that in the unmodified ResNet, the 1 × 1 [convolution](https://paperswithcode.com/method/convolution) for the downsampling block ignores 3/4 of input feature maps, so this is modified so no information will be ignored",
  "title": "Bag of Tricks for Image Classification with Convolutional Neural Networks",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "AutoTinyBERT",
  "full_name": "AutoTinyBERT",
  "description": "**AutoTinyBERT** is a an efficient [BERT](https://paperswithcode.com/method/bert) variant found through neural architecture search. Specifically, one-shot learning is used to obtain a big Super Pretrained Language Model (SuperPLM), where the objectives of pre-training or task-agnostic BERT distillation are used.  Then, given a specific latency constraint, an evolutionary algorithm is run on the SuperPLM to search optimal architectures. Finally, we extract the corresponding sub-models based on the optimal architectures and further train these models.",
  "title": "AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Memory Network",
  "full_name": "Memory Network",
  "description": "A **Memory Network** provides a memory component that can be read from and written to with the inference capabilities of a neural network model. The motivation is that many neural networks lack a long-term memory component, and their existing memory component encoded by states and weights is too small and not compartmentalized enough to accurately remember facts from the past (RNNs for example, have difficult memorizing and doing tasks like copying). \r\n\r\nA memory network consists of a memory $\\textbf{m}$ (an array of objects indexed by $\\textbf{m}\\_{i}$ and four potentially learned components:\r\n\r\n- Input feature map $I$ - feature representation of the data input.\r\n- Generalization $G$ - updates old memories given the new input.\r\n- Output feature map $O$ - produces new feature map given $I$ and $G$.\r\n- Response $R$ - converts output into the desired response. \r\n\r\nGiven an input $x$ (e.g., an input character, word or sentence depending on the granularity chosen, an image or an audio signal) the flow of the model is as follows:\r\n\r\n1. Convert $x$ to an internal feature representation $I\\left(x\\right)$.\r\n2. Update memories $m\\_{i}$ given the new input: $m\\_{i} = G\\left(m\\_{i}, I\\left(x\\right), m\\right)$, $\\forall{i}$.\r\n3. Compute output features $o$ given the new input and the memory: $o = O\\left(I\\left(x\\right), m\\right)$.\r\n4. Finally, decode output features $o$ to give the final response: $r = R\\left(o\\right)$.\r\n\r\nThis process is applied at both train and test time, if there is a distinction between such phases, that\r\nis, memories are also stored at test time, but the model parameters of $I$, $G$, $O$ and $R$ are not updated. Memory networks cover a wide class of possible implementations. The components $I$, $G$, $O$ and $R$ can potentially use any existing ideas from the machine learning literature.\r\n\r\nImage Source: [Adrian Colyer](https://blog.acolyer.org/2016/03/10/memory-networks/)",
  "title": "Memory Networks",
  "collection": "Working Memory Models",
  "area": "General"
}
{
  "name": "Mixer Layer",
  "full_name": "MLP-Mixer Layer",
  "description": "A Mixer layer is a layer used in the MLP-Mixer architecture proposed by Tolstikhin et. al (2021) for computer vision. Mixer layers consist purely of MLPs, without convolutions or attention. It takes an input of embedded image patches (tokens), with its output having the same shape as its input, similar to that of a Vision Transformer encoder. As suggested by its name, Mixer layers \"mix\" tokens and channels through its \"token mixing\" and \"channel mixing\" MLPs contained the layer. It utilizes previous techniques by other architectures, such as layer normalization, skip-connections, and regularization methods.\r\n\r\nImage credit: Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., ... & Dosovitskiy, A. (2021). Mlp-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems, 34, 24261-24272.",
  "title": "MLP-Mixer: An all-MLP Architecture for Vision",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "R-FCN",
  "full_name": "Region-based Fully Convolutional Network",
  "description": "**Region-based Fully Convolutional Networks**, or **R-FCNs**, are a type of region-based object detector. In contrast to previous region-based object detectors such as Fast/[Faster R-CNN](https://paperswithcode.com/method/faster-r-cnn) that apply a costly per-region subnetwork hundreds of times, R-FCN is fully convolutional with almost all computation shared on the entire image.\r\n\r\nTo achieve this, R-FCN utilises position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection.",
  "title": "R-FCN: Object Detection via Region-based Fully Convolutional Networks",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "AltCLIP",
  "full_name": "AltCLIP",
  "description": "In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model. Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding. Our models and code are available at https://github.com/FlagAI-Open/FlagAI.",
  "title": "AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "Talking-Heads Attention",
  "full_name": "Talking-Heads Attention",
  "description": "**Talking-Heads Attention** is a variation on [multi-head attention](https://paperswithcode.com/method/multi-head-attention) which includes linear projections across the attention-heads dimension, immediately before and after the [softmax](https://paperswithcode.com/method/softmax) operation. In [multi-head attention](https://paperswithcode.com/method/multi-head-attention), the different attention heads perform separate computations, which are then summed at the end. Talking-Heads Attention breaks that separation. Two additional learned linear projections are inserted, $P\\_{l}$ and $P\\_{w}$, which transform the attention-logits and the attention weights respectively, moving information across attention heads. Instead of one \"heads\" dimension $h$ across the whole computation, we now have three separate heads dimensions: $h\\_{k}$, $h$, and $h\\_{v}$, which can optionally differ in size (number of \"heads\"). $h\\_{k}$ refers to the number of attention heads for the keys and the queries. $h$ refers to the number of attention heads for the logits and the weights, and $h\\_{v}$ refers to the number of attention heads for the values.",
  "title": "Talking-Heads Attention",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "SDAE",
  "full_name": "Stacked Denoising Autoencoder",
  "description": "The Stacked Denoising Autoencoder (SdA) is an extension of the stacked autoencoder [Bengio07] and it was introduced in [Vincent08].\r\n\r\nDenoising autoencoders can be stacked to form a deep network by feeding the latent representation (output code) of the [denoising autoencoder](https://paperswithcode.com/method/denoising-autoencoder) found on the layer below as input to the current layer. The unsupervised pre-training of such an architecture is done one layer at a time. Each layer is trained as a denoising autoencoder by minimizing the error in reconstructing its input (which is the output code of the previous layer). Once the first k layers are trained, we can train the k+1-th layer because we can now compute the code or latent representation from the layer below.\r\n\r\nOnce all layers are pre-trained, the network goes through a second stage of training called fine-tuning. Here we consider supervised fine-tuning where we want to minimize prediction error on a supervised task. For this, we first add a [logistic regression](https://paperswithcode.com/method/logistic-regression) layer on top of the network (more precisely on the output code of the output layer). We then train the entire network as we would train a multilayer perceptron. At this point, we only consider the encoding parts of each auto-encoder. This stage is supervised, since now we use the target class during training. (See the Multilayer Perceptron for details on the multilayer perceptron.)\r\n\r\nThis can be easily implemented in Theano, using the class defined previously for a denoising autoencoder. We can see the stacked denoising autoencoder as having two facades: a list of autoencoders, and an MLP. During pre-training we use the first facade, i.e., we treat our model as a list of autoencoders, and train each autoencoder seperately. In the second stage of training, we use the second facade. These two facades are linked because:\r\n* the autoencoders and the sigmoid layers of the MLP share parameters, and\r\n* the latent representations computed by intermediate layers of the MLP are fed as input to the autoencoders.\r\n\r\nExtracted from [webpage](http://deeplearning.net/tutorial/SdA.html)\r\n\r\nImage: [Jigar Bandaria](https://miro.medium.com/max/701/1*wbaL5CvUkVkZxlSUsRS5IQ.png)\r\n\r\n**Source**:\r\n\r\nImage: [Jigar Bandaria](https://blog.insightdatascience.com/brain-mri-image-segmentation-using-stacked-denoising-autoencoders-4e91417688f6)\r\n\r\nWebpage: [deeplearning.net](http://deeplearning.net/tutorial/SdA.html)\r\n\r\nWebpage: [www.iro.umontreal.ca](http://www.iro.umontreal.ca/~pift6266/H10/notes/SdA.html)\r\n\r\nPaper:\r\n\r\n[Vincent, H. Larochelle Y. Bengio and P.A. Manzagol, Extracting and Composing Robust Features with Denoising Autoencoders](https://doi.org/10.1145/1390156.1390294)\r\n\r\n[Vincent, H. Larochelle Y. Bengio and P.A. Manzagol, Extracting and Composing Robust Features with Denoising Autoencoders](http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/217)",
  "title": null,
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Graph Contrastive Coding",
  "full_name": "Graph Contrastive Coding",
  "description": "**Graph Contrastive Coding** is a self-supervised graph neural network pre-training framework to capture the universal network topological properties across multiple networks. GCC's pre-training task is designed as subgraph instance discrimination in and across networks and leverages contrastive learning to empower graph neural networks to learn the intrinsic and transferable structural representations.",
  "title": "GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "Discrete Cosine Transform",
  "full_name": "Discrete Cosine Transform",
  "description": "**Discrete Cosine Transform (DCT)** is an orthogonal transformation method that decomposes an\r\nimage to its spatial frequency spectrum. It expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. It is used a lot in compression tasks, e..g image compression where for example high-frequency components can be discarded. It is a type of Fourier-related Transform, similar to discrete fourier transforms (DFTs), but only using real numbers.\r\n\r\nImage Credit: [Wikipedia](https://en.wikipedia.org/wiki/Discrete_cosine_transform#/media/File:Example_dft_dct.svg)",
  "title": null,
  "collection": "Fourier-related Transforms",
  "area": "General"
}
{
  "name": "L-GCN",
  "full_name": "Learnable adjacency matrix GCN",
  "description": "Graph structure is learnable",
  "title": "The World as a Graph: Improving El Niño Forecasts with Graph Neural Networks",
  "collection": "Graph Representation Learning",
  "area": "Graphs"
}
{
  "name": "DiCE Unit",
  "full_name": "DiCE Unit",
  "description": "A **DiCE Unit** is an image model block that is built using dimension-wise convolutions and dimension-wise fusion.  The dimension-wise convolutions apply light-weight convolutional filtering across each dimension of the input tensor while dimension-wise fusion efficiently combines these dimension-wise representations; allowing the DiCE unit to efficiently encode spatial and channel-wise information contained in the input tensor. \r\n\r\nStandard convolutions encode spatial and channel-wise information simultaneously, but they are computationally expensive. To improve the efficiency of standard convolutions, separable [convolution](https://paperswithcode.com/method/convolution) are introduced, where spatial and channelwise information are encoded separately using depth-wise and point-wise convolutions, respectively. Though this factorization is effective, it puts a significant computational load on point-wise convolutions and makes them a computational bottleneck.\r\n\r\nDiCE Units utilize a dimension-wise convolution to encode depth-wise, width-wise, and height-wise information independently. The dimension-wise convolutions encode local information from different dimensions of the input tensor, but do not capture global information. One approach is a [pointwise convolution](https://paperswithcode.com/method/pointwise-convolution), but it is computationally expensive, so instead dimension-wise fusion factorizes the point-wise convolution in two steps: (1) local fusion and (2) global fusion.",
  "title": "DiCENet: Dimension-wise Convolutions for Efficient Networks",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "imGHUM",
  "full_name": "imGHUM",
  "description": "**imGHUM** is a generative model of 3D human shape and articulated pose, represented as a signed distance function. The full body is modeled implicitly as a function zero-level-set and without the use of an explicit template mesh. We compute the signed distance $s = S\\left(\\rho, \\alpha\\right)$ and the semantics $c = C\\left(\\rho, \\alpha\\right)$ of a spatial point $\\rho$ to the surface of an articulated human shape defined by the generative latent code $\\alpha$. Using an explicit skeleton, we transform the point $\\rho$ into the normalized coordinate frames as {$\\tilde{\\rho}^{j}$} for $N = 4$ sub-part networks, modeling body, hands, and head. Each sub-model {$S^{j}$} represents a semantic signed-distance function. The sub-models are finally combined consistently using an MLP U to compute the outputs $s$ and $c$ for the full body. The multi-part pipeline builds a full body model as well as sub-part models for head and hands, jointly, in a consistent training loop. \r\n\r\nOn the right of the Figure, we visualize the zero-level-set body surface extracted with marching cubes and the implicit correspondences to a canonical instance given by the output semantics. The semantics allows e.g. for surface coloring or texturing.",
  "title": "imGHUM: Implicit Generative Models of 3D Human Shape and Articulated Pose",
  "collection": "3D Representations",
  "area": "Computer Vision"
}
{
  "name": "Concrete Dropout",
  "full_name": "Concrete Dropout",
  "description": "Please enter a description about the method here",
  "title": "Concrete Dropout",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "CodeSLAM",
  "full_name": "CodeSLAM",
  "description": "CodeSLAM represents the 3D geometry of a scene using the latent space of a variational autoencoder. The depth thus becomes a function of the RGB image and the unknown code, $D = G_\\theta(I,c)$. During training time, the weights of the network $G_\\theta$ are learnt by training the generator and encoder using a standard autoencoding task. At test time the code $c$ and the pose of the images is found by optimizing the reprojection error over multiple images.",
  "title": "CodeSLAM - Learning a Compact, Optimisable Representation for Dense Visual SLAM",
  "collection": "3D Reconstruction",
  "area": "Computer Vision"
}
{
  "name": "U-RNNs",
  "full_name": "Asymmetrical Bi-RNN",
  "description": "An aspect of Bi-RNNs that could be undesirable is the architecture's symmetry in both time directions.\r\n\r\n Bi-RNNs are often used in natural language processing, where the order of the words is almost exclusively determined by grammatical rules and not by temporal sequentiality.  However, in some cases, the data has a preferred direction in time: the forward direction. \r\n\r\nAnother potential drawback of Bi-RNNs is that their output is simply the concatenation of two naive readings of the input in both directions. In consequence, Bi-RNNs never actually read an input by knowing what happens in the future. Conversely, the idea behind U-RNN, is to first do a backward pass, and then use during the forward pass information about the future.\r\n\r\nWe accumulate information while knowing which part of the information will be useful in the future as it should be relevant to do so if the forward direction is the preferred direction of the data.\r\n\r\nThe backward and forward hidden states $(h^b_t)$ and  $(h^f_t)$ are obtained according to these equations:\r\n\r\n\\begin{equation}\r\n\\begin{aligned}\r\n&h_{t-1}^{b}=R N N\\left(h_{t}^{b}, e_{t}, W_{b}\\right) \\\\\r\n&h_{t+1}^{f}=R N N\\left(h_{t}^{f},\\left[e_{t}, h_{t}^{b}\\right], W_{f}\\right)\r\n\\end{aligned}\r\n\\end{equation}\r\n\r\nwhere $W_b$ and $W_f$ are learnable weights that are shared among pedestrians, and $[\\cdot, \\cdot]$ denotes concatenation. The last hidden state $h^f_{T_{obs}}$ is then used as the encoding of the sequence.",
  "title": "Asymmetrical Bi-RNN for pedestrian trajectory encoding",
  "collection": "Bidirectional Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "Single-path NAS",
  "full_name": "Single-path NAS",
  "description": "**Single-Path NAS** is a convolutional neural network architecture discovered through the Single-Path [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) approach. The NAS utilises a single-path search space. Specifically, compared to previous differentiable  NAS methods, Single-Path NAS uses one single-path over-parameterized  ConvNet to encode all architectural decisions with shared convolutional kernel parameters. The approach is built upon the  observation that different candidate convolutional operations in NAS  can be viewed as subsets of a single superkernel. Without having to  choose among different paths/operations as in multi-path methods, we instead  solve the NAS problem as finding which subset of kernel weights to use in each ConvNet layer. By sharing the convolutional kernel weights,  we encode all candidate NAS operations into a single superkernel.\r\n\r\nThe architecture itself uses the [inverted residual block](https://paperswithcode.com/method/inverted-residual-block) from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2) as its basic building block.",
  "title": "Single-Path NAS: Designing Hardware-Efficient ConvNets in less than 4 Hours",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "XLM-R",
  "full_name": "XLM-R",
  "description": "XLM-R",
  "title": "Unsupervised Cross-lingual Representation Learning at Scale",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "GreedyNAS-A",
  "full_name": "GreedyNAS-A",
  "description": "**GreedyNAS-A** is a convolutional neural network discovered using the [GreedyNAS](https://paperswithcode.com/method/greedynas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) method. The basic building blocks used are inverted residual blocks (from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2)) and squeeze-and-excitation blocks.",
  "title": "GreedyNAS: Towards Fast One-Shot NAS with Greedy Supernet",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "CascadePSP",
  "full_name": "CascadePSP",
  "description": "**CascadePSP** is a general segmentation refinement model that refines any given segmentation from low to high resolution. The model takes as input an initial mask that can be an output of any algorithm to provide a rough object location. Then the CascadePSP will output a refined mask. The model is designed in a cascade fashion that generates refined segmentation in a coarse-to-fine manner. Coarse outputs from the early levels predict object structure which will be used as input to the latter levels to refine boundary details.",
  "title": "CascadePSP: Toward Class-Agnostic and Very High-Resolution Segmentation via Global and Local Refinement",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "MEI",
  "full_name": "Multi-partition Embedding Interaction",
  "description": "**MEI** introduces the *multi-partition embedding interaction* technique with block term tensor format to systematically address the efficiency--expressiveness trade-off in knowledge graph embedding. It divides the embedding vector into multiple partitions and learns the local interaction patterns from data instead of using fixed special patterns as in ComplEx or SimplE models. This enables MEI to achieve optimal efficiency--expressiveness trade-off, not just being fully expressive. Previous methods such as TuckER, RESCAL, DistMult, ComplEx, and SimplE are suboptimal restricted special cases of MEI.",
  "title": "Multi-Partition Embedding Interaction with Block Term Format for Knowledge Graph Completion",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "RMN",
  "full_name": "Residual Masking Network",
  "description": "It uses a segmentation network to refine feature maps, enabling the network to focus on relevant information to make correct decisions.",
  "title": "Facial Expression Recognition using Residual Masking Network",
  "collection": "Attention",
  "area": "General"
}
{
  "name": "Involution",
  "full_name": "Involution",
  "description": "**Involution** is an atomic operation for deep neural networks that inverts the design principles of convolution. Involution kernels are distinct in the spatial extent but shared across channels. If involution kernels are parameterized as fixed-sized matrices like convolution kernels and updated using the back-propagation algorithm, the learned involution kernels are impeded from transferring between input images with variable resolutions. \r\n\r\nThe authors argue for two benefits of involution over convolution: (i) involution can summarize the context in a wider spatial arrangement, thus overcome the difficulty of modeling long-range interactions well; (ii) involution can adaptively allocate the weights over different positions, so as to prioritize the most informative visual elements in the spatial domain.",
  "title": "Involution: Inverting the Inherence of Convolution for Visual Recognition",
  "collection": "Image Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "PNA",
  "full_name": "Principal Neighbourhood Aggregation",
  "description": "**Principal Neighbourhood Aggregation** (PNA) is a general and flexible architecture for graphs combining multiple aggregators with degree-scalers (which generalize the sum aggregator).",
  "title": "Principal Neighbourhood Aggregation for Graph Nets",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "SaBN",
  "full_name": "Sandwich Batch Normalization",
  "description": "Sandwich Batch Normalization (**SaBN**) is a frustratingly easy improvement of [Batch Normalization](https://paperswithcode.com/method/batch-normalization) (BN) with only a few lines of code changes. SaBN is motivated by addressing the inherent *feature distribution heterogeneity* that one can be identified in many tasks, which can arise from data heterogeneity (multiple input domains) or model heterogeneity (dynamic architectures, model conditioning, etc.). Our SaBN factorizes the BN affine layer into one shared *sandwich affine* layer, cascaded by several parallel *independent affine* layers. We demonstrate the prevailing effectiveness of SaBN as a **drop-in replacement in four tasks**: *conditional image generation*, *[neural architecture search](https://paperswithcode.com/method/neural-architecture-search)* (NAS), *adversarial training*, and *arbitrary style transfer*. Leveraging SaBN immediately achieves better Inception Score and FID on CIFAR-10 and ImageNet conditional image generation with three state-of-the-art GANs; boosts the performance of a state-of-the-art weight-sharing NAS algorithm significantly on NAS-Bench-201; substantially improves the robust and standard accuracies for adversarial defense; and produces superior arbitrary stylized results.",
  "title": "Sandwich Batch Normalization: A Drop-In Replacement for Feature Distribution Heterogeneity",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "Early exiting",
  "full_name": "Early exiting using confidence measures",
  "description": "Exit whenever the model is confident enough allowing early exiting from hidden layers",
  "title": "BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "DCLS",
  "full_name": "Dilated convolution with learnable spacings",
  "description": "Dilated convolution with learnable spacings (DCLS) is a type of convolution that allows the spacings between the non-zero elements of the kernel to be learned during training. This makes it possible to increase the receptive field of the convolution without increasing the number of parameters, which can improve the performance of the network on tasks that require long-range dependencies.\r\n\r\nA dilated convolution is a type of convolution that allows the kernel to be skipped over some of the input features. This is done by inserting zeros between the non-zero elements of the kernel. The effect of this is to increase the receptive field of the convolution without increasing the number of parameters.\r\n\r\nDCLS takes this idea one step further by allowing the spacings between the non-zero elements of the kernel to be learned during training. This means that the network can learn to skip over different input features depending on the task at hand. This can be particularly helpful for tasks that require long-range dependencies, such as image segmentation and object detection.\r\n\r\nDCLS has been shown to be effective for a variety of tasks, including image classification, object detection, and semantic segmentation. It is a promising new technique that has the potential to improve the performance of convolutional neural networks on a variety of tasks.",
  "title": "Dilated convolution with learnable spacings",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "Nipuna",
  "full_name": "Optimizer Activation Function",
  "description": "A new activation function named NIPUNA : f(x)=max⁡〖(g(x),x)〗 where g(x)=x/(〖(1+e〗^(-βx)))",
  "title": "0/1 Deep Neural Networks via Block Coordinate Descent",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "LV-ViT",
  "full_name": "LV-ViT",
  "description": "**LV-ViT** is a type of [vision transformer](https://paperswithcode.com/method/vision-transformer) that uses token labelling as a training objective. Different from the standard training\r\nobjective of ViTs that computes the classification loss on an additional trainable class token, token labelling takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator.",
  "title": "All Tokens Matter: Token Labeling for Training Better Vision Transformers",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "SCARLET",
  "full_name": "SCARLET",
  "description": "**SCARLET** is a type of convolutional neural architecture learnt by the [SCARLET-NAS](https://paperswithcode.com/method/scarlet-nas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) method. The three variants are SCARLET-A, SCARLET-B and SCARLET-C. The basic building block is MBConvs from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2). Squeeze-and-excitation layers are also experimented with.",
  "title": "SCARLET-NAS: Bridging the Gap between Stability and Scalability in Weight-sharing Neural Architecture Search",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "MotionNet",
  "full_name": "MotionNet",
  "description": "**MotionNet** is a system for joint perception and motion prediction based on a bird's eye view (BEV) map, which encodes the object category and motion information from 3D point clouds in each grid cell. MotionNet takes a sequence of LiDAR sweeps as input and outputs the bird's eye view (BEV) map. The backbone of MotionNet is a spatio-temporal pyramid network, which extracts deep spatial and temporal features in a hierarchical fashion. To enforce the smoothness of predictions over both space and time, the training of MotionNet is further regularized with novel spatial and temporal consistency losses.",
  "title": "MotionNet: Joint Perception and Motion Prediction for Autonomous Driving Based on Bird's Eye View Maps",
  "collection": "Motion Prediction Models",
  "area": "Computer Vision"
}
{
  "name": "HRNet",
  "full_name": "HRNet",
  "description": "**HRNet**, or **High-Resolution Net**, is a general purpose convolutional neural network for tasks like semantic segmentation, object detection and image classification. It is able to maintain high resolution representations through the whole process. We start from a high-resolution [convolution](https://paperswithcode.com/method/convolution) stream, gradually add high-to-low resolution convolution streams one by one, and connect the multi-resolution streams in parallel. The resulting network consists of several ($4$ in the paper) stages and\r\nthe $n$th stage contains $n$ streams corresponding to $n$ resolutions. The authors conduct repeated multi-resolution fusions by exchanging the information across the parallel streams over and over.",
  "title": "Deep High-Resolution Representation Learning for Visual Recognition",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "MADGRAD",
  "full_name": "Momentumized, adaptive, dual averaged gradient",
  "description": "The MADGRAD method contains a series of modifications to the [AdaGrad](https://paperswithcode.com/method/adagrad)-DA method to improve its performance on deep learning optimization problems. It gives state-of-the-art generalization performance across a diverse set of problems, including those that [Adam](https://paperswithcode.com/method/adam) normally under-performs on.",
  "title": "Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Longformer",
  "full_name": "Longformer",
  "description": "**Longformer** is a modified [Transformer](https://paperswithcode.com/method/transformer) architecture. Traditional [Transformer-based models](https://paperswithcode.com/methods/category/transformers) are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this, **Longformer** uses an attention pattern that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. The attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention.\r\n\r\nThe attention patterns utilised include: [sliding window attention](https://paperswithcode.com/method/sliding-window-attention), [dilated sliding window attention](https://paperswithcode.com/method/dilated-sliding-window-attention) and global + sliding window. These can be viewed in the components section of this page.",
  "title": "Longformer: The Long-Document Transformer",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Packed Levitated Markers",
  "full_name": "Packed Levitated Markers",
  "description": "**Packed Levitated Markers**, or **PL-Marker**, is a span representation approach for [named entity recognition](https://paperswithcode.com/task/named-entity-recognition-ner) that considers the dependencies between spans (pairs) by strategically packing the markers in the encoder. A pair of Levitated Markers, emphasizing a span, consists of a start marker and an end marker which share the same position embeddings with span’s start and end tokens respectively. In addition, both levitated markers adopt a restricted attention, that is, they are visible to each other, but not to the text token and other pairs of markers. sBased on the above features, the levitated marker would not affect the attended context of the original text tokens, which allows us to flexibly pack a series of related spans with their levitated markers in the encoding phase and thus model their dependencies.",
  "title": "Packed Levitated Marker for Entity and Relation Extraction",
  "collection": "Span Representations",
  "area": "Natural Language Processing"
}
{
  "name": "SimCSE",
  "full_name": "SimCSE",
  "description": "**SimCSE** is a contrastive learning framework for generating sentence embeddings. It utilizes an unsupervised approach, which takes an input sentence and predicts itself in contrastive objective, with only standard [dropout](https://paperswithcode.com/method/dropout) used as noise. The authors find that dropout acts as minimal “data augmentation” of hidden representations, while removing it leads to a representation collapse. Afterwards a supervised approach is used, which incorporates annotated pairs from natural language inference datasets into the contrastive framework, by using “entailment” pairs as positives and “contradiction",
  "title": "SimCSE: Simple Contrastive Learning of Sentence Embeddings",
  "collection": "Sentence Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "CTAB-GAN",
  "full_name": "CTAB-GAN",
  "description": "**CTAB-GAN** is a model for conditional tabular data generation. The generator and discriminator utilize the [DCGAN](https://paperswithcode.com/method/dcgan) architecture. An [auxiliary classifier](https://paperswithcode.com/method/auxiliary-classifier) is also used with an MLP architecture.",
  "title": "Effective and Privacy preserving Tabular Data Synthesizing",
  "collection": "Generative Adversarial Networks",
  "area": "Computer Vision"
}
{
  "name": "Selective Kernel",
  "full_name": "Selective Kernel",
  "description": "A **Selective Kernel** unit is a bottleneck block consisting of a sequence of 1×1 [convolution](https://paperswithcode.com/method/convolution), SK convolution and 1×1 convolution. It was proposed as part of the [SKNet](https://paperswithcode.com/method/sknet) CNN architecture. In general, all the large kernel convolutions in the original bottleneck blocks in [ResNeXt](https://paperswithcode.com/method/resnext) are replaced by the proposed SK convolutions, enabling the network to choose appropriate receptive field sizes in an adaptive manner. \r\n\r\nIn SK units, there are three important hyper-parameters which determine the final settings of SK convolutions: the number of paths $M$ that determines the number of choices of different kernels to be aggregated, the group number $G$ that controls the cardinality of each path, and the reduction ratio $r$ that controls the number of parameters in the fuse operator. One typical setting of SK convolutions is $\\text{SK}\\left[M, G, r\\right]$ to be $\\text{SK}\\left[2, 32, 16\\right]$.",
  "title": "Selective Kernel Networks",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "RPDet",
  "full_name": "RPDet",
  "description": "**RPDet**, or **RepPoints Detector**, is a anchor-free, two-stage object detection model based on deformable convolutions.  [RepPoints](https://paperswithcode.com/method/reppoints) serve as the basic object representation throughout the detection system. Starting from the center points, the first set of RepPoints is obtained via regressing offsets over the center points. The learning of these RepPoints is driven by two objectives: 1) the top-left and bottom-right points distance loss between the induced pseudo box and the ground-truth bounding box; 2) the object recognition loss of the subsequent stage.",
  "title": "RepPoints: Point Set Representation for Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "LDA",
  "full_name": "Linear Discriminant Analysis",
  "description": "**Linear discriminant analysis** (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics, pattern recognition, and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.\r\n\r\nExtracted from [Wikipedia](https://en.wikipedia.org/wiki/Linear_discriminant_analysis)\r\n\r\n**Source**:\r\n\r\nPaper: [Linear Discriminant Analysis: A Detailed Tutorial](https://dx.doi.org/10.3233/AIC-170729)\r\n\r\nPublic version: [Linear Discriminant Analysis: A Detailed Tutorial](https://usir.salford.ac.uk/id/eprint/52074/)",
  "title": null,
  "collection": "Dimensionality Reduction",
  "area": "General"
}
{
  "name": "SimVLM",
  "full_name": "Simple Visual Language Model",
  "description": "SimVLM is a minimalist pretraining framework to reduce training complexity by exploiting large-scale weak supervision. It is trained end-to-end with a single prefix language modeling (PrefixLM) objective. PrefixLM enables bidirectional attention within the prefix sequence, and thus it is applicable for both decoder-only\r\nand encoder-decoder sequence-to-sequence language models.",
  "title": "SimVLM: Simple Visual Language Model Pretraining with Weak Supervision",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "CutBlur",
  "full_name": "CutBlur",
  "description": "**CutBlur** is a data augmentation method that is specifically designed for the low-level vision tasks. It cuts a low-resolution patch and pastes it to the corresponding high-resolution image region and vice versa. The key intuition of Cutblur is to enable a model to learn not only \"how\" but also \"where\" to super-resolve an image. By doing so, the model can understand \"how much\" instead of blindly learning to apply super-resolution to every given pixel.",
  "title": "Rethinking Data Augmentation for Image Super-resolution: A Comprehensive Analysis and a New Strategy",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "DE-GAN",
  "full_name": "DE-GAN: A Conditional Generative Adversarial Network for Document Enhancement",
  "description": "Documents often exhibit various forms of degradation, which make it hard to be read and substantially deteriorate the\r\nperformance of an OCR system. In this paper, we propose an effective end-to-end framework named Document Enhancement\r\nGenerative Adversarial Networks (DE-GAN) that uses the conditional GANs (cGANs) to restore severely degraded document images.\r\nTo the best of our knowledge, this practice has not been studied within the context of generative adversarial deep networks. We\r\ndemonstrate that, in different tasks (document clean up, binarization, deblurring and watermark removal), DE-GAN can produce an\r\nenhanced version of the degraded document with a high quality. In addition, our approach provides consistent improvements compared to state-of-the-art methods over the widely used DIBCO 2013, DIBCO 2017 and H-DIBCO 2018 datasets, proving its ability to restore a degraded document image to its ideal condition. The obtained results on a wide variety of degradation reveal the flexibility of the proposed model to be exploited in other document enhancement problems.",
  "title": null,
  "collection": "Generative Adversarial Networks",
  "area": "Computer Vision"
}
{
  "name": "WaveVAE",
  "full_name": "WaveVAE",
  "description": "**WaveVAE** is a generative audio model that can be used as a vocoder in text-to-speech systems. It is a [VAE](https://paperswithcode.com/method/vae) based model that can be trained from scratch by jointly optimizing the encoder $q\\_{\\phi}\\left(\\mathbf{z}|\\mathbf{x}, \\mathbf{c}\\right)$ and decoder $p\\_{\\theta}\\left(\\mathbf{x}|\\mathbf{z}, \\mathbf{c}\\right)$, where $\\mathbf{z}$ is latent variables and $\\mathbf{c}$ is the mel spectrogram conditioner. \r\n\r\nThe encoder of WaveVAE $q\\_{\\phi}\\left(\\mathbf{z}|\\mathbf{x}\\right)$ is parameterized by a Gaussian autoregressive [WaveNet](https://paperswithcode.com/method/wavenet) that maps the ground truth audio x into the same length latent representation $\\mathbf{z}$. The decoder $p\\_{\\theta}\\left(\\mathbf{x}|\\mathbf{z}\\right)$ is parameterized by the one-step ahead predictions from an inverse autoregressive flow.\r\n\r\nThe training objective is the ELBO for the observed $\\mathbf{x}$ in the VAE.",
  "title": "Non-Autoregressive Neural Text-to-Speech",
  "collection": "Generative Audio Models",
  "area": "Audio"
}
{
  "name": "Fawkes",
  "full_name": "Fawkes",
  "description": "**Fawkes** is an image cloaking system that helps individuals inoculate their images against unauthorized facial recognition models. Fawkes achieves this by helping users add imperceptible pixel-level changes (\"cloaks\") to their own photos before releasing them. When used to train facial recognition models, these \"cloaked\" images produce functional models that consistently cause normal images of the user to be misidentified.",
  "title": "Fawkes: Protecting Privacy against Unauthorized Deep Learning Models",
  "collection": "Face Privacy",
  "area": "Computer Vision"
}
{
  "name": "VQ-VAE",
  "full_name": "VQ-VAE",
  "description": "**VQ-VAE** is a type of variational autoencoder that uses vector quantisation to obtain a discrete latent representation. It differs from [VAEs](https://paperswithcode.com/method/vae) in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, ideas from vector quantisation (VQ) are incorporated. Using the VQ method allows the model to circumvent issues of posterior collapse - where the latents are ignored when they are paired with a powerful autoregressive decoder - typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes.",
  "title": "Neural Discrete Representation Learning",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "UNITER",
  "full_name": "UNiversal Image-TExt Representation Learning",
  "description": "UNITER or UNiversal Image-TExt Representation model is a large-scale pre-trained model for joint multimodal embedding. It is pre-trained using four image-text datasets COCO, Visual Genome, Conceptual Captions, and SBU Captions. It can power heterogeneous downstream V+L tasks with joint multimodal embeddings. \r\nUNITER takes the visual regions of the image and textual tokens of the sentence as inputs. A faster R-CNN is used in Image Embedder to extract the visual features of each region and a Text Embedder is used to tokenize the input sentence into WordPieces.  \r\n\r\nIt proposes WRA via the Optimal Transport to provide more fine-grained alignment between word tokens and image regions that is effective in calculating the minimum cost of transporting the contextualized image embeddings to word embeddings and vice versa. \r\n\r\nFour pretraining tasks were designed for this model. They are Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). This model is different from the previous models because it uses conditional masking on pre-training tasks.",
  "title": "UNITER: UNiversal Image-TExt Representation Learning",
  "collection": "Word Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "Knowledge Distillation",
  "full_name": "Knowledge Distillation",
  "description": "A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\r\nSource: [Distilling the Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531)",
  "title": "Distilling the Knowledge in a Neural Network",
  "collection": "Knowledge Distillation",
  "area": "General"
}
{
  "name": "bilayer decoupling",
  "full_name": "bilayer convolutional neural network",
  "description": "",
  "title": "Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers",
  "collection": "Instance Segmentation Modules",
  "area": "Computer Vision"
}
{
  "name": "UCNet",
  "full_name": "UCNet",
  "description": "**UCNet** is a probabilistic framework for RGB-D Saliency Detection that employs uncertainty by learning from the data labelling process. It utilizes conditional variational autoencoders to model human annotation uncertainty and generate multiple saliency maps for each input image by sampling in the latent space.",
  "title": "UC-Net: Uncertainty Inspired RGB-D Saliency Detection via Conditional Variational Autoencoders",
  "collection": "RGB-D Saliency Detection Models",
  "area": "Computer Vision"
}
{
  "name": "OPT-IML",
  "full_name": "OPT-IML",
  "description": "**OPT-IML** is a version of OPT fine-tuned on a large collection of 1500+ NLP tasks divided into various task categories.",
  "title": "OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "R2D2",
  "full_name": "Recurrent Replay Distributed DQN",
  "description": "Building on the recent successes of distributed training of RL agents, R2D2 is an RL approach that trains a RNN-based RL agents from distributed prioritized experience replay. \r\nUsing a single network architecture and fixed set of hyperparameters, Recurrent Replay Distributed DQN quadrupled the previous state of the art on Atari-57, and matches the state of the art on DMLab-30. \r\nIt was the first agent to exceed human-level performance in 52 of the 57 Atari games.",
  "title": "Recurrent Experience Replay in Distributed Reinforcement Learning",
  "collection": "Offline Reinforcement Learning Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "Categorical Modularity",
  "full_name": "Categorical Modularity",
  "description": "A novel low-resource intrinsic metric to evaluate word\r\nembedding quality based on graph modularity.",
  "title": "Evaluating Word Embeddings with Categorical Modularity",
  "collection": "Word Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "SIRM",
  "full_name": "Skim and Intensive Reading Model",
  "description": "**Skim and Intensive Reading Model**, or **SIRM**, is a deep neural network for figuring out implied textual meaning. It consists of two main components, namely the skim reading component and intensive reading component. N-gram features are quickly extracted from the skim reading component, which is a combination of several convolutional neural networks, as skim (entire) information. An intensive reading component enables a hierarchical investigation for both local (sentence) and global (paragraph) representation, which encapsulates the current embedding and the contextual information with a dense connection.",
  "title": "Read Beyond the Lines: Understanding the Implied Textual Meaning via a Skim and Intensive Reading Model",
  "collection": "Textual Meaning",
  "area": "Natural Language Processing"
}
{
  "name": "Capsule Network",
  "full_name": "Capsule Network",
  "description": "A capsule is an activation vector that basically executes on its inputs some complex internal\r\ncomputations. Length of these activation vectors signifies the\r\nprobability of availability of a feature. Furthermore, the condition\r\nof the recognized element is encoded as the direction in which\r\nthe vector is pointing. In traditional, CNN uses Max pooling for\r\ninvariance activities of neurons, which is nothing except a minor\r\nchange in input and the neurons of output signal will remains\r\nsame.",
  "title": "Dynamic Routing Between Capsules",
  "collection": "Neural Architecture Search",
  "area": "General"
}
{
  "name": "Leaky ReLU",
  "full_name": "Leaky ReLU",
  "description": "**Leaky Rectified Linear Unit**, or **Leaky ReLU**, is a type of activation function based on a [ReLU](https://paperswithcode.com/method/relu), but it has a small slope for negative values instead of a flat slope. The slope coefficient is determined before training, i.e. it is not learnt during training. This type of activation function is popular in tasks where we may suffer from sparse gradients, for example training generative adversarial networks.",
  "title": null,
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "PPO",
  "full_name": "Proximal Policy Optimization",
  "description": "**Proximal Policy Optimization**, or **PPO**, is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of [TRPO](https://paperswithcode.com/method/trpo), while using only first-order optimization. \r\n\r\nLet $r\\_{t}\\left(\\theta\\right)$ denote the probability ratio $r\\_{t}\\left(\\theta\\right) = \\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}$, so $r\\left(\\theta\\_{old}\\right) = 1$. TRPO maximizes a “surrogate” objective:\r\n\r\n$$ L^{\\text{CPI}}\\left({\\theta}\\right) = \\hat{\\mathbb{E}}\\_{t}\\left[\\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)})\\hat{A}\\_{t}\\right] = \\hat{\\mathbb{E}}\\_{t}\\left[r\\_{t}\\left(\\theta\\right)\\hat{A}\\_{t}\\right] $$\r\n\r\nWhere $CPI$ refers to a conservative policy iteration. Without a constraint, maximization of $L^{CPI}$ would lead to an excessively large policy update; hence, we PPO modifies the objective, to penalize changes to the policy that move $r\\_{t}\\left(\\theta\\right)$ away from 1:\r\n\r\n$$ J^{\\text{CLIP}}\\left({\\theta}\\right) = \\hat{\\mathbb{E}}\\_{t}\\left[\\min\\left(r\\_{t}\\left(\\theta\\right)\\hat{A}\\_{t}, \\text{clip}\\left(r\\_{t}\\left(\\theta\\right), 1-\\epsilon, 1+\\epsilon\\right)\\hat{A}\\_{t}\\right)\\right] $$\r\n\r\nwhere $\\epsilon$ is a hyperparameter, say, $\\epsilon = 0.2$. The motivation for this objective is as follows. The first term inside the min is $L^{CPI}$. The second term, $\\text{clip}\\left(r\\_{t}\\left(\\theta\\right), 1-\\epsilon, 1+\\epsilon\\right)\\hat{A}\\_{t}$ modifies the surrogate\r\nobjective by clipping the probability ratio, which removes the incentive for moving $r\\_{t}$ outside of the interval $\\left[1 − \\epsilon, 1 + \\epsilon\\right]$. Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective. With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse. \r\n\r\nOne detail to note is that when we apply PPO for a network where we have shared parameters for actor and critic functions, we typically add to the objective function an error term on value estimation and an entropy term to encourage exploration.",
  "title": "Proximal Policy Optimization Algorithms",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "ProxylessNet-CPU",
  "full_name": "ProxylessNet-CPU",
  "description": "**ProxylessNet-CPU** is an image model learnt with the [ProxylessNAS](https://paperswithcode.com/method/proxylessnas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) algorithm that is optimized for CPU devices. It uses inverted residual blocks (MBConvs) from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2) as its basic building block.",
  "title": "ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "Parallel Layers",
  "full_name": "Parallel Layers",
  "description": "• Parallel Layers – We use a “parallel” formulation in each Transformer block (Wang & Komatsuzaki, 2021), rather than the standard “serialized” formulation. Specifically, the standard formulation can be written as:   \r\n    y = x + MLP(LayerNorm(x + Attention(LayerNorm(x)))   \r\n\r\nWhereas the parallel formulation can be written as:   \r\n    y = x + MLP(LayerNorm(x)) + Attention(LayerNorm(x))   \r\n\r\nThe parallel formulation results in roughly 15% faster training speed at large scales, since the MLP and Attention input matrix multiplications can be fused. Ablation experiments showed a small quality degradation at 8B scale but no quality degradation at 62B scale, so we extrapolated that the effect of parallel layers should be quality neutral at the 540B scale.",
  "title": "PaLM: Scaling Language Modeling with Pathways",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "EncAttAgg",
  "full_name": "Encoder-Attender-Aggregator",
  "description": "EncAttAgg introduced two attenders to tackle two problems: 1) We introduce a mutual attender layer to efficiently obtain the entity-pair-specific mention representations. 2) We introduce an integration attender to weight mention pairs of a target entity pair.",
  "title": null,
  "collection": null,
  "area": null
}
{
  "name": "SBERT",
  "full_name": "Sentence-BERT",
  "description": "",
  "title": "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
  "collection": "Sentence Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "IC-SBP",
  "full_name": "Instance Colouring Stick-Breaking Process",
  "description": "",
  "title": "GENESIS-V2: Inferring Unordered Object Representations without Iterative Refinement",
  "collection": "Clustering",
  "area": "General"
}
{
  "name": "Soups",
  "full_name": "Model Soups",
  "description": "Compress an ensemble of models into a single one by averaging their weights (under certain pre-conditions).",
  "title": "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time",
  "collection": "Model Compression",
  "area": "General"
}
{
  "name": "DenseNet",
  "full_name": "DenseNet",
  "description": "A **DenseNet** is a type of convolutional neural network that utilises [dense connections](https://paperswithcode.com/method/dense-connections) between layers, through [Dense Blocks](http://www.paperswithcode.com/method/dense-block), where we connect *all layers* (with matching feature-map sizes) directly with each other. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers.",
  "title": "Densely Connected Convolutional Networks",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Reformer",
  "full_name": "Reformer",
  "description": "**Reformer** is a [Transformer](https://paperswithcode.com/method/transformer) based architecture that seeks to make efficiency improvements. [Dot-product attention](https://paperswithcode.com/method/dot-product-attention) is replaced by one that uses locality-sensitive hashing, changing its complexity\r\nfrom O($L^2$) to O($L\\log L$), where $L$ is the length of the sequence. Furthermore, Reformers use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers.",
  "title": "Reformer: The Efficient Transformer",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "MelGAN",
  "full_name": "MelGAN",
  "description": "**MelGAN** is a non-autoregressive feed-forward convolutional architecture to perform audio waveform generation in a [GAN](https://paperswithcode.com/method/gan) setup. The architecture is a fully convolutional feed-forward network with mel-spectrogram $s$ as input and raw waveform $x$ as output. Since the mel-spectrogram is at\r\na 256× lower temporal resolution, the authors use a stack of transposed convolutional layers to upsample the input sequence. Each transposed convolutional layer is followed by a stack of residual blocks with dilated convolutions. Unlike traditional GANs, the MelGAN generator does not use a global noise vector as input.\r\n\r\nTo deal with 'checkerboard artifacts' in audio, instead of using [PhaseShuffle](https://paperswithcode.com/method/phase-shuffle), MelGAN uses kernel-size as a multiple of stride.\r\n\r\n[Weight normalization](https://paperswithcode.com/method/weight-normalization) is used for normalization. A [window-based discriminator](https://paperswithcode.com/method/window-based-discriminator), similar to a [PatchGAN](https://paperswithcode.com/method/patchgan) is used for the discriminator.",
  "title": "MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis",
  "collection": "Generative Audio Models",
  "area": "Audio"
}
{
  "name": "Spatial Feature Transform",
  "full_name": "Spatial Feature Transform",
  "description": "**Spatial Feature Transform**, or **SFT**, is a layer that generates affine transformation parameters for spatial-wise feature modulation, and was originally proposed within the context of image super-resolution. A Spatial Feature Transform (SFT) layer learns a mapping function $\\mathcal{M}$ that outputs a modulation parameter pair $(\\mathbf{\\gamma}, \\mathbf{\\beta})$ based on some prior condition $\\Psi$. The learned parameter pair adaptively influences the outputs by applying an affine transformation spatially to each intermediate feature maps in an SR network. During testing, only a single forward pass is needed to generate the HR image given the LR input and segmentation probability maps.\r\n\r\nMore precisely, the prior $\\Psi$ is modeled by a pair of affine transformation parameters $(\\mathbf{\\gamma}, \\mathbf{\\beta})$ through a mapping function $\\mathcal{M}: \\Psi \\mapsto(\\mathbf{\\gamma}, \\mathbf{\\beta})$. Consequently,\r\n\r\n$$\r\n\\hat{\\mathbf{y}}=G_{\\mathbf{\\theta}}(\\mathbf{x} \\mid \\mathbf{\\gamma}, \\mathbf{\\beta}), \\quad(\\mathbf{\\gamma}, \\mathbf{\\beta})=\\mathcal{M}(\\Psi)\r\n$$\r\n\r\nAfter obtaining $(\\mathbf{\\gamma}, \\mathbf{\\beta})$ from conditions, the transformation is carried out by scaling and shifting feature maps of a specific layer:\r\n\r\n$$\r\n\\operatorname{SFT}(\\mathbf{F} \\mid \\mathbf{\\gamma}, \\mathbf{\\beta})=\\mathbf{\\gamma} \\odot \\mathbf{F}+\\mathbf{\\beta}\r\n$$\r\n\r\nwhere $\\mathbf{F}$ denotes the feature maps, whose dimension is the same as $\\gamma$ and $\\mathbf{\\beta}$, and $\\odot$ is referred to element-wise multiplication, i.e., Hadamard product. Since the spatial dimensions are preserved, the SFT layer not only performs feature-wise manipulation but also spatial-wise transformation.",
  "title": "Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "uNetXST",
  "full_name": "uNetXST",
  "description": "uNet neural network architecture which takes multiple (X) tensors as input and contains [Spatial Transformer](https://paperswithcode.com/method/spatial-transformer) units (ST)",
  "title": "A Sim2Real Deep Learning Approach for the Transformation of Images from Multiple Vehicle-Mounted Cameras to a Semantically Segmented Image in Bird's Eye View",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "SVD Parameterization",
  "full_name": "Singular Value Decomposition Parameterization",
  "description": "",
  "title": "Stabilizing Gradients for Deep Neural Networks via Efficient SVD Parameterization",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Streaming Module",
  "full_name": "Streaming Module",
  "description": "",
  "title": "FeatherNets: Convolutional Neural Networks as Light as Feather for Face Anti-spoofing",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "Fast-YOLOv3",
  "full_name": "Fast-YOLOv3",
  "description": "",
  "title": "YOLOv3: An Incremental Improvement",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Highway Layer",
  "full_name": "Highway Layer",
  "description": "A **Highway Layer** contains an information highway to other layers that helps with information flow. It is characterised by the use of a gating unit to help this information flow. \r\n\r\nA plain feedforward neural network typically consists of $L$ layers where the $l$th layer ($l \\in ${$1, 2, \\dots, L$}) applies a nonlinear transform $H$ (parameterized by $\\mathbf{W\\_{H,l}}$) on its input $\\mathbf{x\\_{l}}$ to produce its output $\\mathbf{y\\_{l}}$. Thus, $\\mathbf{x\\_{1}}$ is the input to the network and $\\mathbf{y\\_{L}}$ is the network’s output. Omitting the layer index and biases for clarity,\r\n\r\n$$ \\mathbf{y} = H\\left(\\mathbf{x},\\mathbf{W\\_{H}}\\right) $$\r\n\r\n$H$ is usually an affine transform followed by a non-linear activation function, but in general it may take other forms. \r\n\r\nFor a [highway network](https://paperswithcode.com/method/highway-network), we additionally define two nonlinear transforms $T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right)$ and $C\\left(\\mathbf{x},\\mathbf{W\\_{C}}\\right)$ such that:\r\n\r\n$$ \\mathbf{y} = H\\left(\\mathbf{x},\\mathbf{W\\_{H}}\\right)·T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right) + \\mathbf{x}·C\\left(\\mathbf{x},\\mathbf{W\\_{C}}\\right)$$\r\n\r\nWe refer to T as the transform gate and C as the carry gate, since they express how much of the output is produced by transforming the input and carrying it, respectively. In the original paper, the authors set $C = 1 − T$, giving:\r\n\r\n$$ \\mathbf{y} = H\\left(\\mathbf{x},\\mathbf{W\\_{H}}\\right)·T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right) + \\mathbf{x}·\\left(1-T\\left(\\mathbf{x},\\mathbf{W\\_{T}}\\right)\\right)$$\r\n\r\nThe authors set:\r\n\r\n$$ T\\left(x\\right) = \\sigma\\left(\\mathbf{W\\_{T}}^{T}\\mathbf{x} + \\mathbf{b\\_{T}}\\right) $$\r\n\r\nImage: [Sik-Ho Tsang](https://towardsdatascience.com/review-highway-networks-gating-function-to-highway-image-classification-5a33833797b5)",
  "title": "Highway Networks",
  "collection": "Miscellaneous Components",
  "area": "General"
}
{
  "name": "KungFu",
  "full_name": "KungFu",
  "description": "**KungFu** is a distributed ML library for TensorFlow that is designed to enable adaptive training. KungFu allows users to express high-level Adaptation Policies (APs) that describe how to change hyper- and system parameters during training. APs take real-time monitored metrics (e.g. signal-to-noise ratios and noise scale) as input and trigger control actions (e.g. cluster rescaling or synchronisation strategy updates). For execution, APs are translated into monitoring and control operators, which are embedded in the dataflow graph. APs exploit an efficient asynchronous collective communication layer, which ensures concurrency and consistency\r\nof monitoring and adaptation operations.",
  "title": null,
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "STATEGAME MAINTAIN PICTURE BALANCED PLAY STABLE",
  "full_name": "ATTEMPT THIS FATHINETUTE TO REPOPULATE ALREADY POPULATED  SYSTEM",
  "description": "",
  "title": "0+ and 1+ heavy-light exotic mesons at N2LO in the chiral limit",
  "collection": "Portrait Matting Models",
  "area": "Computer Vision"
}
{
  "name": "GNS",
  "full_name": "Graph Network-based Simulators",
  "description": "**Graph Network-Based Simulators** is a type of graph neural network that represents the state of a physical system with particles, expressed as nodes in a graph, and computes dynamics via learned message-passing.",
  "title": "Learning to Simulate Complex Physics with Graph Networks",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "Single-Headed Attention",
  "full_name": "Single-Headed Attention",
  "description": "**Single-Headed Attention** is a single-headed attention module used in the [SHA-RNN](https://paperswithcode.com/method/sha-rnn) language model. The principle design reasons for single-headedness were simplicity (avoiding running out of memory) and scepticism about the benefits of using multiple heads.",
  "title": "Single Headed Attention RNN: Stop Thinking With Your Head",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "FLICA",
  "full_name": "A Framework for Leader Identification in Coordinated Activity",
  "description": "An agreement of a group to follow a common purpose is manifested by its coalescence into a coordinated behavior. The process of initiating this behavior and the period of decision-making by the group members necessarily precedes the coordinated behavior. Given time series of group members’ behavior, the goal is to find these periods of decision-making and identify the initiating individual, if one exists.\r\n\r\nImage Source:  [Amornbunchornvej et al.](https://arxiv.org/pdf/1603.01570v2.pdf)",
  "title": "Coordination Event Detection and Initiator Identification in Time Series Data",
  "collection": "Time Series Analysis",
  "area": "Sequential"
}
{
  "name": "PonderNet",
  "full_name": "PonderNet",
  "description": "**PonderNet** is an adaptive computation method that learns to adapt the amount of computation based on the complexity of the problem at hand. PonderNet learns end-to-end the number of computational steps to achieve an effective compromise between training prediction accuracy, computational cost and generalization.",
  "title": "PonderNet: Learning to Ponder",
  "collection": "Adaptive Computation",
  "area": "General"
}
{
  "name": "AlexNet",
  "full_name": "AlexNet",
  "description": "**AlexNet** is a classic convolutional neural network architecture. It consists of convolutions, [max pooling](https://paperswithcode.com/method/max-pooling) and dense layers as the basic building blocks. Grouped convolutions are used in order to fit the model across two GPUs.",
  "title": "ImageNet Classification with Deep Convolutional Neural Networks",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Contrastive Learning",
  "full_name": null,
  "description": "",
  "title": null,
  "collection": "Graph Representation Learning",
  "area": "Graphs"
}
{
  "name": "POMO",
  "full_name": "POMO",
  "description": "",
  "title": "POMO: Policy Optimization with Multiple Optima for Reinforcement Learning",
  "collection": "Reinforcement Learning Frameworks",
  "area": "Reinforcement Learning"
}
{
  "name": "DiffAugment",
  "full_name": "DiffAugment",
  "description": "**Differentiable Augmentation (DiffAugment)** is a set of differentiable image transformations used to augment data during [GAN](https://paperswithcode.com/method/gan) training. The transformations are applied to the real and generated images. It enables the gradients to be propagated through the augmentation back to the generator, regularizes\r\nthe discriminator without manipulating the target distribution, and maintains the balance of training\r\ndynamics. Three choices of transformation are preferred by the authors in their experiments: Translation, [CutOut](https://paperswithcode.com/method/cutout), and Color.",
  "title": "Differentiable Augmentation for Data-Efficient GAN Training",
  "collection": "Adversarial Training",
  "area": "General"
}
{
  "name": "Coordinate attention",
  "full_name": "Coordinate attention",
  "description": "Hou et al. proposed coordinate attention,\r\na novel attention mechanism which\r\nembeds positional information into channel attention,\r\nso that the network can focus on large important regions \r\nat little computational cost.\r\n\r\nThe coordinate attention mechanism has two consecutive steps, coordinate information embedding and coordinate attention generation. First, two spatial extents of pooling kernels encode each channel horizontally  and  vertically. In the second step, a shared $1\\times 1$ convolutional transformation function is applied to the concatenated outputs of the two pooling layers. Then coordinate attention splits the resulting tensor into two separate tensors to yield attention vectors with the same number of channels for horizontal and vertical coordinates of the  input $X$ along. This can be written as \r\n\\begin{align}\r\n    z^h &= \\text{GAP}^h(X) \r\n\\end{align}\r\n\\begin{align}\r\n    z^w &= \\text{GAP}^w(X)\r\n\\end{align}\r\n\\begin{align}\r\n    f &= \\delta(\\text{BN}(\\text{Conv}_1^{1\\times 1}([z^h;z^w])))\r\n\\end{align}\r\n\\begin{align}\r\n    f^h, f^w &= \\text{Split}(f)\r\n\\end{align}\r\n\\begin{align}\r\n    s^h &= \\sigma(\\text{Conv}_h^{1\\times 1}(f^h))\r\n\\end{align}\r\n\\begin{align}\r\n    s^w &= \\sigma(\\text{Conv}_w^{1\\times 1}(f^w))\r\n\\end{align}\r\n\\begin{align}\r\n    Y &= X s^h  s^w\r\n\\end{align}\r\nwhere $\\text{GAP}^h$ and $\\text{GAP}^w$ denote pooling functions for vertical and horizontal coordinates, and $s^h \\in \\mathbb{R}^{C\\times 1\\times W}$ and $s^w \\in \\mathbb{R}^{C\\times H\\times 1}$ represent corresponding attention weights. \r\n\r\nUsing coordinate attention, the network can accurately obtain the position of a targeted object.\r\nThis approach has a larger receptive field than BAM and CBAM.\r\nLike an SE block, it also models cross-channel relationships, effectively enhancing the expressive power of the learned features.\r\nDue to its lightweight design and flexibility, \r\nit can be easily used in classical building blocks of mobile networks.",
  "title": "Coordinate Attention for Efficient Mobile Network Design",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "ELECTRA",
  "full_name": "ELECTRA",
  "description": "**ELECTRA** is a [transformer](https://paperswithcode.com/method/transformer) with a new pre-training approach which trains two transformer models: the generator and the discriminator. The generator replaces tokens in the sequence - trained as a masked language model - and the discriminator (the ELECTRA contribution) attempts to identify which tokens are replaced by the generator in the sequence. This pre-training task is called replaced token detection, and is a replacement for masking the input.",
  "title": "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "RPM-Net",
  "full_name": "RPM-Net",
  "description": "**RPM-Net** is an end-to-end differentiable deep network for robust point matching uses learned features. It preserves robustness of RPM against noisy/outlier points while desensitizing initialization with point correspondences from learned feature distances instead of spatial distances. The network uses the differentiable Sinkhorn layer and annealing to get soft assignments of point correspondences from hybrid features learned from both spatial coordinates and local geometry. To further improve registration performance, the authors introduce a secondary network to predict optimal annealing parameters.",
  "title": "RPM-Net: Robust Point Matching using Learned Features",
  "collection": "Point Cloud Models",
  "area": "Computer Vision"
}
{
  "name": "Differential attention for visual question answering",
  "full_name": "Differential attention for visual question answering",
  "description": "In this paper we aim to answer questions based on images when provided with a dataset of question-answer pairs for a number of images during training. A number of methods have focused on solving this problem by using image based attention. This is done by focusing on a specific part of the image while answering the question. Humans also do so when solving this problem. However, the regions that the previous systems focus on are not correlated with the regions that humans focus on. The accuracy is limited due to this drawback. In this paper, we propose to solve this problem by using an exemplar based method. We obtain one or more supporting and opposing exemplars to obtain a differential attention region. This differential attention is closer to human attention than other image based attention methods. It also helps in obtaining improved accuracy when answering questions. The method is evaluated on challenging benchmark datasets. We perform better than other image based attention methods and are competitive with other state of the art methods that focus on both image and questions.",
  "title": null,
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "NNCF",
  "full_name": "Neural Network Compression Framework",
  "description": "**Neural Network Compression Framework**, or **NNCF**, is a Python-based framework for neural network compression with fine-tuning. It leverages recent advances of various network compression methods and implements some of them, namely quantization, sparsity, filter pruning and binarization. These methods allow producing more hardware-friendly models that can be efficiently run on general-purpose hardware computation units (CPU, GPU) or specialized deep learning accelerators.",
  "title": "Neural Network Compression Framework for fast model inference",
  "collection": "Model Compression",
  "area": "General"
}
{
  "name": "Step Decay",
  "full_name": "Step Decay",
  "description": "**Step Decay** is a learning rate schedule that drops the learning rate by a factor every few epochs, where the number of epochs is a hyperparameter.\r\n\r\nImage Credit: [Suki Lau](https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1)",
  "title": null,
  "collection": "Learning Rate Schedules",
  "area": "General"
}
{
  "name": "RESCAL RP",
  "full_name": "RESCAL",
  "description": "",
  "title": "A Three-Way Model for Collective Learning on Multi-Relational Data",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "PermuteFormer",
  "full_name": "PermuteFormer",
  "description": "**PermuteFormer** is a [Performer](https://paperswithcode.com/method/performer)-based model with relative position encoding that scales linearly on long sequences. PermuteFormer applies position-dependent transformation on queries and keys to encode positional information into the attention module. This transformation is carefully crafted so that the final output of self-attention is not affected by absolute positions of tokens.\r\n\r\nEach token’s query / key feature is illustrated as a row of blocks in the figure, and its elements are marked with different colors. The position-aware permutation permutes elements of each token’s query / key feature along the head size dimension in each attention head. Depending on the token’s position, the permutation applied to query / key feature is different.",
  "title": "PermuteFormer: Efficient Relative Position Encoding for Long Sequences",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "XCiT",
  "full_name": "XCiT",
  "description": "**Cross-Covariance Image Transformers**, or **XCiT**, is a type of [vision transformer](https://paperswithcode.com/methods/category/vision-transformer) that aims to combine the accuracy of [conventional transformers](https://paperswithcode.com/methods/category/transformers) with the scalability of [convolutional architectures](https://paperswithcode.com/methods/category/convolutional-neural-networks). \r\n\r\nThe [self-attention operation](https://paperswithcode.com/method/scaled) underlying transformers yields global interactions between all tokens, i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. The authors propose a “transposed” version of self-attention called [cross-covariance attention](https://paperswithcode.com/method/cross-covariance-attention) that operates across feature channels rather than tokens, where the interactions are based on the cross-covariances matrix between keys and queries.",
  "title": "XCiT: Cross-Covariance Image Transformers",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "Proximity Regularization",
  "full_name": "Proximity Regularization",
  "description": "",
  "title": "Federated Optimization in Heterogeneous Networks",
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "CodeGen",
  "full_name": "CodeGen",
  "description": "**CodeGen** is an autoregressive transformers with next-token prediction language modeling as the learning objective trained on a natural language corpus and programming language data curated from GitHub.",
  "title": "CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "SPEED",
  "full_name": "SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings",
  "description": "The monocular depth estimation (MDE) is the task of estimating depth from a single frame. This information is an essential knowledge in many computer vision tasks such as scene understanding and visual odometry, which are key components in autonomous and robotic systems. \r\nApproaches based on the state of the art vision transformer architectures are extremely deep and complex not suitable for real-time inference operations on edge and autonomous systems equipped with low resources (i.e. robot indoor navigation and surveillance). This paper presents SPEED, a Separable Pyramidal pooling EncodEr-Decoder architecture designed to achieve real-time frequency performances on multiple hardware platforms. The proposed model is a fast-throughput deep architecture for MDE able to obtain depth estimations with high accuracy from low resolution images using minimum hardware resources (i.e. edge devices). Our encoder-decoder model exploits two depthwise separable pyramidal pooling layers, which allow to increase the inference frequency while reducing the overall computational complexity. The proposed method performs better than other fast-throughput architectures in terms of both accuracy and frame rates, achieving real-time performances over cloud CPU, TPU and the NVIDIA Jetson TX1 on two indoor benchmarks: the NYU Depth v2 and the DIML Kinect v2 datasets.",
  "title": null,
  "collection": "Monocular Depth Estimation Models",
  "area": "Computer Vision"
}
{
  "name": "DABMD",
  "full_name": "Distributed Any-Batch Mirror Descent",
  "description": "**Distributed Any-Batch Mirror Descent** (DABMD) is based on distributed Mirror Descent but uses a fixed per-round computing time to limit the waiting by fast nodes to receive information updates from slow nodes. DABMD is characterized by varying minibatch sizes across nodes. It is applicable to a broader range of problems compared with existing distributed online optimization methods such as those based on dual averaging, and it accommodates time-varying network topology.",
  "title": null,
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "1cycle",
  "full_name": "1cycle learning rate scheduling policy",
  "description": "",
  "title": "A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay",
  "collection": "Learning Rate Schedules",
  "area": "General"
}
{
  "name": "Randomized Deletion",
  "full_name": "Randomized Deletion",
  "description": "",
  "title": null,
  "collection": "Robustness Methods",
  "area": "General"
}
{
  "name": "CSPResNeXt",
  "full_name": "CSPResNeXt",
  "description": "**CSPResNeXt** is a convolutional neural network where we apply the Cross Stage Partial Network (CSPNet) approach to [ResNeXt](https://paperswithcode.com/method/resnext). The CSPNet partitions the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network.",
  "title": "CSPNet: A New Backbone that can Enhance Learning Capability of CNN",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "DELU",
  "full_name": "DELU",
  "description": "The **DELU** is a type of activation function that has trainable parameters, uses the complex linear and exponential functions in the positive dimension and uses the **[SiLU](https://paperswithcode.com/method/silu)** in the negative dimension.\r\n\r\n$$DELU(x) = SiLU(x), x \\leqslant 0$$\r\n$$DELU(x) = (n + 0.5)x + |e^{-x} - 1|, x > 0$$",
  "title": "Trainable Activations for Image Classification",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "DAEL",
  "full_name": "Domain Adaptive Ensemble Learning",
  "description": "**Domain Adaptive Ensemble Learning**, or **DAEL**, is an architecture for domain adaptation. The model is composed of a CNN feature extractor shared across domains and multiple classifier heads each trained to specialize in a particular source domain. Each such classifier is an expert to its own domain and a non-expert to others. DAEL aims to learn these experts collaboratively so that when forming an ensemble, they can leverage complementary information from each other to be more effective for an unseen target domain. To this end, each source domain is used in turn as a pseudo-target-domain with its own expert providing supervisory signal to the ensemble of non-experts learned from the other sources. For unlabeled target data under the UDA setting where real expert does not exist, DAEL uses pseudo-label to supervise the ensemble learning.",
  "title": "Domain Adaptive Ensemble Learning",
  "collection": "Domain Adaptation",
  "area": "General"
}
{
  "name": "Rational Activation Function",
  "full_name": "Rational Activation Function",
  "description": "Rational Activation Functions, ratio of polynomials as learnable functions",
  "title": "Padé Activation Units: End-to-end Learning of Flexible Activation Functions in Deep Networks",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "DLA",
  "full_name": "Deep Layer Aggregation",
  "description": "**DLA**, or **Deep Layer Aggregation**,  iteratively and hierarchically merges the feature hierarchy across layers in neural networks to make networks with better accuracy and fewer parameters. \r\n\r\nIn iterative deep aggregation (IDA), aggregation begins at the shallowest, smallest scale and then iteratively merges deeper,\r\nlarger scales. In this way shallow features are refined as\r\nthey are propagated through different stages of aggregation.\r\n\r\nIn hierarchical deep aggregation (HDA), blocks and stages\r\nin a tree are merged to preserve and combine feature channels. With\r\nHDA shallower and deeper layers are combined to learn\r\nricher combinations that span more of the feature hierarchy.\r\nWhile IDA effectively combines stages, it is insufficient\r\nfor fusing the many blocks of a network, as it is still only\r\nsequential.",
  "title": "Deep Layer Aggregation",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "Pyramidal Bottleneck Residual Unit",
  "full_name": "Pyramidal Bottleneck Residual Unit",
  "description": "A **Pyramidal Bottleneck Residual Unit** is a type of residual unit where the number of channels gradually increases as a function of the depth at which the layer occurs, which is similar to a pyramid structure of which the shape gradually widens from the top downwards. It also consists of a bottleneck using 1x1 convolutions. It was introduced as part of the [PyramidNet](https://paperswithcode.com/method/pyramidnet) architecture.",
  "title": null,
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "mBERT",
  "full_name": "mBERT",
  "description": "mBERT",
  "title": "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "GRIN",
  "full_name": "Graph Recurrent Imputation Network",
  "description": "",
  "title": null,
  "collection": "Bidirectional Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "Variational Dropout",
  "full_name": "Variational Dropout",
  "description": "**Variational Dropout** is a regularization technique based on [dropout](https://paperswithcode.com/method/dropout), but uses a variational inference grounded approach. In Variational Dropout, we repeat the same dropout mask at each time step for both inputs, outputs, and recurrent layers (drop the same network units at each time step). This is in contrast to ordinary Dropout where different dropout masks are sampled at each time step for the inputs and outputs alone.",
  "title": "A Theoretically Grounded Application of Dropout in Recurrent Neural Networks",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "PAUSE",
  "full_name": "PAUSE",
  "description": "**PAUSE**, or **Positive and Annealed Unlabeled Sentence Embedding**, is an approach for learning sentence embeddings from a partially labeled dataset. It is based on a dual encoder schema that is widely adopted in supervised sentence embedding training. Each individual sample $\\mathbf{x}$ contains a pair of hypothesis and premise sentences $(x\\_{i},x^{\\prime}_{i})$, each of which is fed into a pretrained encoder (e.g. [BERT](https://paperswithcode.com/method/bert)). As shown in Figure, the two encoders are identical during the training by sharing their weights.",
  "title": "PAUSE: Positive and Annealed Unlabeled Sentence Embedding",
  "collection": "Sentence Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "FLAVR",
  "full_name": "FLAVR",
  "description": "**FLAVR** is an architecture for video frame interpolation. It uses 3D space-time convolutions to enable end-to-end learning and inference for video frame interpolation. Overall, it consists of a [U-Net](https://paperswithcode.com/method/u-net) style architecture with 3D space-time convolutions and\r\ndeconvolutions (yellow blocks). Channel gating is used after all (de-)[convolution](https://paperswithcode.com/method/convolution) layers (blue blocks). The final prediction layer (the purple block) is implemented as a convolution layer to project the 3D feature maps into $(k−1)$ frame predictions. This design allows FLAVR to predict multiple frames in one inference forward pass.",
  "title": "FLAVR: Flow-Agnostic Video Representations for Fast Frame Interpolation",
  "collection": "Video Interpolation Models",
  "area": "Computer Vision"
}
{
  "name": "BigBiGAN",
  "full_name": "BigBiGAN",
  "description": "**BigBiGAN** is a type of [BiGAN](https://paperswithcode.com/method/bigan) with a [BigGAN](https://paperswithcode.com/method/biggan) image generator. The authors initially used [ResNet](https://paperswithcode.com/method/resnet) as a baseline for the encoder $\\mathcal{E}$ followed by a 4-layer MLP with skip connections, but they experimented with RevNets and found they outperformed with increased network width, so opted for this type of encoder for the final architecture.",
  "title": "Large Scale Adversarial Representation Learning",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Lovasz-Softmax",
  "full_name": "Lovasz-Softmax",
  "description": "The **Lovasz-Softmax loss** is a loss function for multiclass semantic segmentation that incorporates the [softmax](https://paperswithcode.com/method/softmax) operation in the Lovasz extension. The Lovasz extension is a means by which we can achieve direct optimization of the mean intersection-over-union loss in neural networks.",
  "title": null,
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "SSE",
  "full_name": "Stochastic Steady-state Embedding",
  "description": "Stochastic Steady-state Embedding (SSE) is an algorithm that can learn many steady-state algorithms over graphs. Different from graph neural network family models, SSE is trained stochastically which only requires 1-hop information, but can capture fixed point relationships efficiently and effectively.\r\n\r\nDescription and Image from: [Learning Steady-States of Iterative Algorithms over Graphs](https://proceedings.mlr.press/v80/dai18a.html)",
  "title": "Learning Steady-States of Iterative Algorithms over Graphs",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "RE-NET",
  "full_name": "Recurrent Event Network",
  "description": "Recurrent Event Network (RE-NET) is an autoregressive architecture for predicting future interactions. The occurrence of a fact (event) is modeled as a probability distribution conditioned on temporal sequences of past knowledge graphs. RE-NET employs a recurrent event encoder to encode past facts and uses a neighborhood aggregator to model the connection of facts at the same timestamp. Future facts can then be inferred in a sequential manner based on the two modules.",
  "title": "Recurrent Event Network: Autoregressive Structure Inference over Temporal Knowledge Graphs",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "Deformable ConvNets",
  "full_name": "Deformable Convolutional Networks",
  "description": "Deformable ConvNets do not learn an affine transformation. They divide convolution into two steps, firstly sampling features on a regular grid $ \\mathcal{R} $ from the input feature map, then aggregating sampled features by weighted summation using a convolution kernel. The process can be written as:\r\n\\begin{align}\r\n    Y(p_{0}) &= \\sum_{p_i \\in \\mathcal{R}} w(p_{i}) X(p_{0} + p_{i})\r\n\\end{align}\r\n\\begin{align}\r\n    \\mathcal{R}  &= \\{(-1,-1), (-1, 0), \\dots, (1, 1)\\}\r\n\\end{align}\r\nThe deformable convolution augments the sampling process by introducing a group of learnable offsets $\\Delta p_{i}$ which can be generated by a lightweight CNN. Using the offsets $\\Delta p_{i}$, the deformable convolution can be formulated as:\r\n\\begin{align}\r\n    Y(p_{0}) &= \\sum_{p_i \\in \\mathcal{R}} w(p_{i}) X(p_{0} + p_{i} + \\Delta p_{i}). \r\n\\end{align}\r\nThrough the above method, adaptive sampling is achieved.\r\nHowever, $\\Delta p_{i}$ is a floating point value\r\nunsuited to grid sampling. \r\nTo address this problem, bilinear interpolation is used. Deformable RoI pooling is also used, which greatly improves object detection. \r\n\r\nDeformable ConvNets adaptively select the important regions and enlarge the valid receptive field of convolutional neural networks; this is important in object detection and semantic segmentation tasks.",
  "title": "Deformable Convolutional Networks",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "ScaledSoftSign",
  "full_name": "ScaledSoftSign",
  "description": "The **ScaledSoftSign** is a modification of **[SoftSign](https://paperswithcode.com/method/softsign-activation)** activation function that has trainable parameters.\r\n\r\n$$ScaledSoftSign(x) = \\frac{\\alpha x}{\\beta + |x|}$$",
  "title": "Trainable Activations for Image Classification",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "LCC",
  "full_name": "Lipschitz Constant Constraint",
  "description": "Please enter a description about the method here",
  "title": "Regularisation of Neural Networks by Enforcing Lipschitz Continuity",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "SegSort",
  "full_name": "Segment Sorting",
  "description": "",
  "title": "SegSort: Segmentation by Discriminative Sorting of Segments",
  "collection": "Image Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "TDN",
  "full_name": "Temporaral Difference Network",
  "description": "**TDN**, or **Temporaral Difference Network**, is an action recognition model that aims to capture multi-scale temporal information. To fully capture temporal information over the entire video, the TDN is established with a two-level difference modeling paradigm. Specifically, for local motion modeling, temporal difference over consecutive frames is used to supply 2D CNNs with finer motion pattern, while for global motion modeling, temporal difference across segments is incorporated to capture long-range structure for motion feature excitation.",
  "title": "TDN: Temporal Difference Networks for Efficient Action Recognition",
  "collection": "Action Recognition Models",
  "area": "Computer Vision"
}
{
  "name": "DropConnect",
  "full_name": "DropConnect",
  "description": "**DropConnect** generalizes [Dropout](https://paperswithcode.com/method/dropout) by randomly dropping the weights rather than the activations with probability $1-p$. DropConnect is similar to Dropout as it introduces dynamic sparsity within the model, but differs in that the sparsity is on the weights $W$, rather than the output vectors of a layer. In other words, the fully connected layer with DropConnect becomes a sparsely connected layer in which the connections are chosen at random during the training stage. Note that this is not equivalent to setting $W$ to be a fixed sparse matrix during training.\r\n\r\nFor a DropConnect layer, the output is given as:\r\n\r\n$$ r = a \\left(\\left(M * W\\right){v}\\right)$$\r\n\r\nHere $r$ is the output of a layer, $v$ is the input to a layer, $W$ are weight parameters, and $M$ is a binary matrix encoding the connection information where $M\\_{ij} \\sim \\text{Bernoulli}\\left(p\\right)$. Each element of the mask $M$ is drawn independently for each example during training, essentially instantiating a different connectivity for each example seen. Additionally, the biases are also masked out during training.",
  "title": "Regularization of Neural Networks using DropConnect",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Asynchronous Interaction Aggregation",
  "full_name": "Asynchronous Interaction Aggregation",
  "description": "**Asynchronous Interaction Aggregation**, or **AIA**, is a network that leverages different interactions to boost action detection. There are two key designs in it: one is the Interaction Aggregation structure (IA) adopting a uniform paradigm to model and integrate multiple types of interaction; the other is the Asynchronous Memory Update algorithm (AMU) that enables us to achieve better performance by modeling very long-term interaction dynamically.",
  "title": "Asynchronous Interaction Aggregation for Action Detection",
  "collection": "Action Recognition Models",
  "area": "Computer Vision"
}
{
  "name": "Fishr",
  "full_name": "Fishr",
  "description": "**Fishr** is a learning scheme to enforce domain invariance in the space of the gradients of the loss function: specifically, it introduces a regularization term that matches the domain-level variances of gradients across training domains. Critically, the strategy exhibits close relations with the Fisher Information and the Hessian of the loss. Forcing domain-level gradient covariances to be similar during the learning procedure eventually aligns the domain-level loss landscapes locally around the final weights.",
  "title": "Fishr: Invariant Gradient Variances for Out-of-Distribution Generalization",
  "collection": "Robustness Methods",
  "area": "General"
}
{
  "name": "HPO",
  "full_name": "Hyper-parameter optimization",
  "description": "In machine learning, a hyperparameter is a parameter whose value is used to control learning process, and HPO is the problem of choosing a set of optimal hyperparameters for a learning algorithm.",
  "title": "Algorithms for Hyper-Parameter Optimization",
  "collection": "AutoML",
  "area": "General"
}
{
  "name": "rnnDrop",
  "full_name": "rnnDrop",
  "description": "**rnnDrop** is a [dropout](https://paperswithcode.com/method/dropout) based regularization technique for [recurrent neural networks](https://paperswithcode.com/methods/category/recurrent-neural-networks). It amounts to using the same dropout mask at every timestep. It drops both the non-recurrent and recurrent connections. A simple figure to explain the idea is shown to the right. The figure shows an RNN being trained with rnnDrop for three frames $\\left(t-1, t, t+1\\right)$ on two different training sequences in the data (denoted as ‘sequence1’ and ‘sequence2’). The black circles denote the randomly omitted hidden nodes during training, and the dotted arrows stand for the model weights connected to those omitted nodes.\r\n\r\n*From: RnnDrop: A Novel Dropout for RNNs in ASR by Moon et al*",
  "title": null,
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "PANet",
  "full_name": "PANet",
  "description": "**Path Aggregation Network**, or **PANet**, aims to boost information flow in a proposal-based instance segmentation framework. Specifically, the feature hierarchy is enhanced with accurate localization signals in lower layers by [bottom-up path augmentation](https://paperswithcode.com/method/bottom-up-path-augmentation), which shortens the information path between lower layers and topmost feature. Additionally, [adaptive feature pooling](https://paperswithcode.com/method/adaptive-feature-pooling) is employed, which links feature grid and all feature levels to make useful information in each feature level propagate directly to following proposal subnetworks. A complementary branch capturing different views for each proposal is created to further improve mask prediction.",
  "title": "Path Aggregation Network for Instance Segmentation",
  "collection": "Instance Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "CAM",
  "full_name": "Class-activation map",
  "description": "Class activation maps could be used to interpret the prediction decision made by the convolutional neural network (CNN).\r\n\r\nImage source: [Learning Deep Features for Discriminative Localization](https://paperswithcode.com/paper/learning-deep-features-for-discriminative)",
  "title": "Is Object Localization for Free? - Weakly-Supervised Learning With Convolutional Neural Networks",
  "collection": "Interpretability",
  "area": "General"
}
{
  "name": "ClariNet",
  "full_name": "ClariNet",
  "description": "**ClariNet** is an end-to-end text-to-speech architecture. Unlike previous TTS systems which use text-to-spectogram models with a separate waveform [synthesizer](https://paperswithcode.com/method/synthesizer) (vocoder), ClariNet is a text-to-wave architecture that is fully convolutional and can be trained from scratch. In ClariNet, the [WaveNet](https://paperswithcode.com/method/wavenet) module is conditioned on the hidden states instead of the mel-spectogram. The architecture is otherwise based on [Deep Voice 3](https://paperswithcode.com/method/deep-voice-3).",
  "title": "ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech",
  "collection": "Text-to-Speech Models",
  "area": "Audio"
}
{
  "name": "SNIPER",
  "full_name": "SNIPER",
  "description": "**SNIPER** is a multi-scale training approach for instance-level recognition tasks like object detection and instance-level segmentation. Instead of processing all pixels in an image pyramid, SNIPER selectively processes context regions around the ground-truth objects (a.k.a chips). This can help to speed up multi-scale training as it operates on low-resolution chips. Due to its memory-efficient design, SNIPER can benefit from [Batch Normalization](https://paperswithcode.com/method/batch-normalization) during training and it makes larger batch-sizes possible for instance-level recognition tasks on a single GPU.",
  "title": "SNIPER: Efficient Multi-Scale Training",
  "collection": "Multi-Scale Training",
  "area": "Computer Vision"
}
{
  "name": "Invertible 1x1 Convolution",
  "full_name": "Invertible 1x1 Convolution",
  "description": "The **Invertible 1x1 Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) used in flow-based generative models that reverses the ordering of channels. The weight matrix is initialized as a random rotation matrix. The log-determinant of an invertible 1 × 1 convolution of a $h \\times w \\times c$ tensor $h$ with $c \\times c$ weight matrix $\\mathbf{W}$ is straightforward to compute:\r\n\r\n$$ \\log | \\text{det}\\left(\\frac{d\\text{conv2D}\\left(\\mathbf{h};\\mathbf{W}\\right)}{d\\mathbf{h}}\\right) | = h \\cdot w \\cdot \\log | \\text{det}\\left(\\mathbf{W}\\right) | $$",
  "title": "Glow: Generative Flow with Invertible 1x1 Convolutions",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "CRF-RNN",
  "full_name": "CRF-RNN",
  "description": "**CRF-RNN** is a formulation of a [CRF](https://paperswithcode.com/method/crf) as a Recurrent Neural Network. Specifically it formulates mean-field approximate inference for the Conditional Random Fields with Gaussian pairwise potentials as Recurrent Neural Networks.",
  "title": "Conditional Random Fields as Recurrent Neural Networks",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "Tree-structured Parzen Estimator Approach (TPE)",
  "full_name": "Tree-structured Parzen Estimator Approach (TPE)",
  "description": "",
  "title": "Hyperopt: A Python Library for Optimizing the Hyperparameters of Machine Learning Algorithms",
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "VEGA",
  "full_name": "VEGA",
  "description": "**VEGA** is an AutoML framework that is compatible and optimized for multiple hardware platforms. It integrates various modules of AutoML, including [Neural Architecture Search](https://paperswithcode.com/method/neural-architecture-search) (NAS), Hyperparameter Optimization (HPO), Auto Data Augmentation, Model Compression, and Fully Train. To support a variety of search algorithms and tasks, it involves a fine-grained search space and a description language to enable easy adaptation to different search algorithms and tasks.",
  "title": "VEGA: Towards an End-to-End Configurable AutoML Pipeline",
  "collection": "AutoML",
  "area": "General"
}
{
  "name": "ParamCrop",
  "full_name": "ParamCrop",
  "description": "**ParamCrop** is a parametric cubic cropping for video contrastive learning, where cubic cropping refers to cropping a 3D cube\r\nfrom the input video. The central component of ParamCrop is a differentiable spatio-temporal cropping operation, which enables ParamCrop to be trained simultaneously with the video backbone and adjust the cropping strategy on the fly. The objective of ParamCrop is set to be adversarial to the video backbone, which is to increase the contrastive loss. Hence, initialized with the simplest setting where two cropped views largely overlaps, ParamCrop gradually increases the disparity between two views.",
  "title": "ParamCrop: Parametric Cubic Cropping for Video Contrastive Learning",
  "collection": "Generative Video Models",
  "area": "Computer Vision"
}
{
  "name": "RIFE",
  "full_name": "RIFE",
  "description": "**RIFE**, or **Real-time Intermediate Flow Estimation** is an intermediate flow estimation algorithm for Video Frame Interpolation (VFI). Many recent flow-based VFI methods first estimate the bi-directional optical flows, then scale and reverse them to approximate intermediate flows, leading to artifacts on motion boundaries. RIFE uses a neural network named [IFNet](https://paperswithcode.com/method/ifnet) that can directly estimate the intermediate flows from coarse-to-fine with much better speed. It introduces a privileged distillation scheme for training intermediate flow model, which leads to a large performance improvement.\r\n\r\nIn RIFE training, given two input frames $I_{0}, I_{1}$, we directly feed them into the IFNet to approximate intermediate flows $F_{t \\rightarrow 0}, F_{t \\rightarrow 1}$ and the fusion map $M$. During training phase, a privileged teacher refines student's results to get $F_{t \\rightarrow 0}^{T e a}, F_{t \\rightarrow 1}^{T e a}$ and $M^{\\text {Tea }}$ based on ground truth $I_{t}$. The student model and the teacher model are jointly trained from scratch using the reconstruction loss. The teacher's approximations are more accurate so that they can guide the student to learn.",
  "title": "RIFE: Real-Time Intermediate Flow Estimation for Video Frame Interpolation",
  "collection": "Video Frame Interpolation",
  "area": "Computer Vision"
}
{
  "name": "SSKD",
  "full_name": "Semi-Supervised Knowledge Distillation",
  "description": "**Semi-Supervised Knowledge Distillation** is a type of knowledge distillation for person re-identification that exploits weakly annotated data by assigning soft pseudo labels to YouTube-Human to improve models' generalization ability. SSKD first trains a student model (e.g. [ResNet](https://paperswithcode.com/method/resnet)-50) and a teacher model (e.g. ResNet-101) using labeled data from multi-source domain datasets. Then, SSKD develops an [auxiliary classifier](https://paperswithcode.com/method/auxiliary-classifier) to imitate the soft predictions of unlabeled data generated by the teacher model. Meanwhile, the student model is also supervised by hard labels and predicted soft labels by the teacher model for labeled data.",
  "title": "Semi-Supervised Domain Generalizable Person Re-Identification",
  "collection": "Knowledge Distillation",
  "area": "General"
}
{
  "name": "Disentangled Attention Mechanism",
  "full_name": "Disentangled Attention Mechanism",
  "description": "**Disentangled Attention Mechanism** is an attention mechanism used in the [DeBERTa](https://paperswithcode.com/method/deberta) architecture. Unlike [BERT](https://paperswithcode.com/method/bert) where each word in the input layer is represented using a vector which is the sum of its word (content) embedding and position embedding, each word in DeBERTa is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices based on their contents and relative positions, respectively. This is motivated by the observation that the attention weight of a word pair depends on not only their contents but their relative positions. For example, the dependency between the words “deep” and “learning” is much stronger when they occur next to each other than when they occur in different sentences.",
  "title": "DeBERTa: Decoding-enhanced BERT with Disentangled Attention",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Res2Net",
  "full_name": "Res2Net",
  "description": "**Res2Net** is an image model that employs a variation on bottleneck residual blocks. The motivation is to be able to represent features at multiple scales. This is achieved through a novel building block for CNNs that constructs hierarchical residual-like connections within one single [residual block](https://paperswithcode.com/method/residual-block).\r\nThis represents multi-scale features at a granular level and increases the range of receptive fields for each network layer.",
  "title": "Res2Net: A New Multi-scale Backbone Architecture",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "HetPipe",
  "full_name": "HetPipe",
  "description": "**HetPipe** is a hybrid parallel method that integrates pipelined model parallelism (PMP) with data parallelism (DP). In HetPipe, a group of multiple GPUs, called a virtual worker, processes minibatches in a pipelined manner, and multiple such virtual workers employ data parallelism for higher performance.",
  "title": null,
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "TernaryBERT",
  "full_name": "TernaryBERT",
  "description": "**TernaryBERT** is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based model which ternarizes the weights of a pretrained [BERT](https://paperswithcode.com/method/bert) model to $\\{-1,0,+1\\}$, with different granularities for word embedding and weights in the Transformer layer. Instead of directly using knowledge distillation to compress a model, it is used to improve the performance of ternarized student model with the same size as the teacher model. In this way, we transfer the knowledge from the highly-accurate teacher model to the ternarized student model with smaller capacity.",
  "title": "TernaryBERT: Distillation-aware Ultra-low Bit BERT",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "PLATO-2",
  "full_name": "PLATO-2",
  "description": "",
  "title": "PLATO-2: Towards Building an Open-Domain Chatbot via Curriculum Learning",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Feature-Centric Voting",
  "full_name": "Feature-Centric Voting",
  "description": "",
  "title": "Voting for Voting in Online Point Cloud Object Detection",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "NormFormer",
  "full_name": "NormFormer",
  "description": "**NormFormer** is a type of [Pre-LN](https://paperswithcode.com/method/layer-normalization) transformer that adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a Layer Norm after the first [fully connected layer](https://paperswithcode.com/method/position-wise-feed-forward-layer). The modifications introduce a small number of additional learnable parameters, which provide a cost-effective way for each layer to change the magnitude of its features, and therefore the magnitude of the gradients to subsequent components.",
  "title": "NormFormer: Improved Transformer Pretraining with Extra Normalization",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Graph2Tree",
  "full_name": "Graph-to-Tree MWP Solver",
  "description": "",
  "title": "Graph-to-Tree Learning for Solving Math Word Problems",
  "collection": "Sequence To Sequence Models",
  "area": "Sequential"
}
{
  "name": "ParaNet Convolution Block",
  "full_name": "ParaNet Convolution Block",
  "description": "A **ParaNet Convolution Block** is a convolutional block that appears in the encoder and decoder of the [ParaNet](https://paperswithcode.com/method/paranet) text-to-speech architecture. It consists of a 1-D [convolution](https://paperswithcode.com/method/convolution) with a gated linear unit ([GLU](https://paperswithcode.com/method/glu)) and a [residual connection](https://paperswithcode.com/method/residual-connection). It is similar to the [DV3 Convolution Block](https://paperswithcode.com/method/dv3-convolution-block).",
  "title": "Non-Autoregressive Neural Text-to-Speech",
  "collection": "Audio Model Blocks",
  "area": "Audio"
}
{
  "name": "MODNet",
  "full_name": "MODNet",
  "description": "**MODNet** is a light-weight matting objective decomposition network that can process portrait matting from a single input image in real time. The design of MODNet benefits from optimizing a series of correlated sub-objectives simultaneously via explicit constraints. To overcome the domain shift problem, MODNet introduces a self-supervised strategy based on subobjective consistency (SOC) and  a one-frame delay trick to smooth the results when applying MODNet to portrait video sequence.\r\n\r\nGiven an input image $I$, MODNet predicts human semantics $s\\_{p}$, boundary details $d\\_{p}$, and final alpha matte $\\alpha\\_{p}$ through three interdependent branches, $S, D$, and $F$, which are constrained by specific supervisions generated from the ground truth matte $\\alpha\\_{g}$. Since the decomposed sub-objectives are correlated and help strengthen each other, we can optimize MODNet end-to-end.",
  "title": "MODNet: Real-Time Trimap-Free Portrait Matting via Objective Decomposition",
  "collection": "Portrait Matting Models",
  "area": "Computer Vision"
}
{
  "name": "Conditional Positional Encoding",
  "full_name": "Conditional Positional Encoding",
  "description": "**Conditional Positional Encoding**, or **CPE**, is a type of positional encoding for [vision transformers](https://paperswithcode.com/methods/category/vision-transformer). Unlike previous fixed or learnable positional encodings, which are predefined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE aims to generalize to the input sequences that are longer than what the model has ever seen during training. CPE can also keep the desired translation-invariance in the image classification task. CPE can be implemented with a [Position\r\nEncoding Generator](https://paperswithcode.com/method/positional-encoding-generator) (PEG) and incorporated into the current [Transformer framework](https://paperswithcode.com/methods/category/transformers).",
  "title": "Conditional Positional Encodings for Vision Transformers",
  "collection": "Position Embeddings",
  "area": "General"
}
{
  "name": "DNN2LR",
  "full_name": "DNN2LR",
  "description": "**DNN2LR** is an automatic feature crossing method to find feature interactions in a deep neural network, and use them as cross features in logistic regression. In general, DNN2LR consists of two steps: (1) generating a compact and accurate candidate set of cross feature fields; (2) searching in the candidate set for the final cross feature fields.",
  "title": "DNN2LR: Interpretation-inspired Feature Crossing for Real-world Tabular Data",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "ColorJitter",
  "full_name": "Color Jitter",
  "description": "**ColorJitter** is a type of image data augmentation where we randomly change the brightness, contrast and saturation of an image.\r\n\r\nImage Credit: [Apache MXNet](https://mxnet.apache.org/versions/1.5.0/tutorials/gluon/data_augmentation.html)",
  "title": null,
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "Multi-Query Attention",
  "full_name": "Multi-Query Attention",
  "description": "Multi-head attention consists of multiple attention layers (heads) in parallel with different linear\r\ntransformations on the queries, keys, values and outputs. **Multi-query attention** is identical except that the\r\ndifferent heads share a single set of keys and values.",
  "title": "Fast Transformer Decoding: One Write-Head is All You Need",
  "collection": "Attention",
  "area": "General"
}
{
  "name": "WFST",
  "full_name": "weighted finite state transducer",
  "description": "",
  "title": "NeMo Inverse Text Normalization: From Development To Production",
  "collection": "Rule-based systems",
  "area": "General"
}
{
  "name": "StarReLU",
  "full_name": "StarReLU",
  "description": "$s \\cdot (\\mathrm{ReLU}(x))^2 + b$\r\n\r\nwhere $s \\in \\mathbb{R}$ and $b \\in \\mathbb{R}$ are shared for all channels and can be set as constants (s=0.8944, b=-0.4472) or learnable parameters.",
  "title": "MetaFormer Baselines for Vision",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Cutout",
  "full_name": "Cutout",
  "description": "**Cutout** is an image augmentation and regularization technique that randomly masks out square regions of input during training. and can be used to improve the robustness and overall performance of convolutional neural networks. The main motivation for cutout comes from the problem of object occlusion, which is commonly encountered in many computer vision tasks, such as object recognition,\r\ntracking, or human pose estimation. By generating new images which simulate occluded examples, we not only better prepare the model for encounters with occlusions in the real world, but the model also learns to take more of the image context into consideration when making decisions",
  "title": "Improved Regularization of Convolutional Neural Networks with Cutout",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "ELU",
  "full_name": "Exponential Linear Unit",
  "description": "The **Exponential Linear Unit** (ELU) is an activation function for neural networks. In contrast to [ReLUs](https://paperswithcode.com/method/relu), ELUs have negative values which allows them to push mean unit activations closer to zero like [batch normalization](https://paperswithcode.com/method/batch-normalization) but with lower computational complexity. Mean shifts toward zero speed up learning by bringing the normal gradient closer to the unit natural gradient because of a reduced bias shift effect. While [LReLUs](https://paperswithcode.com/method/leaky-relu) and [PReLUs](https://paperswithcode.com/method/prelu) have negative values, too, they do not ensure a noise-robust deactivation state. ELUs saturate to a negative value with smaller inputs and thereby decrease the forward propagated variation and information.\r\n\r\nThe exponential linear unit (ELU) with $0 < \\alpha$ is:\r\n\r\n$$f\\left(x\\right) = x \\text{ if } x > 0$$\r\n$$\\alpha\\left(\\exp\\left(x\\right) − 1\\right) \\text{ if } x \\leq 0$$",
  "title": "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Hourglass Module",
  "full_name": "Hourglass Module",
  "description": "An **Hourglass Module** is an image block module used mainly for pose estimation tasks. The design of the hourglass is motivated by the need to capture information at every scale. While local evidence is essential for identifying features like faces and hands, a final pose estimate requires a coherent understanding of the full body. The person’s orientation, the arrangement of their limbs, and the relationships of adjacent joints are among the many cues that are best recognized at different scales in the image. The hourglass is a simple, minimal design that has the capacity to capture all of these features and bring them together to output pixel-wise predictions.\r\n\r\nThe network must have some mechanism to effectively process and consolidate features across scales. The Hourglass uses a single pipeline with skip layers to preserve spatial information at each resolution. The network reaches its lowest resolution at 4x4 pixels allowing smaller spatial filters to be applied that compare features across the entire space of the image.\r\n\r\nThe hourglass is set up as follows: Convolutional and [max pooling](https://paperswithcode.com/method/max-pooling) layers are used to process features down to a very low resolution. At each max pooling step, the network branches off and applies more convolutions at the original pre-pooled resolution. After reaching the lowest resolution, the network begins the top-down sequence of upsampling and combination of features across scales. To bring together information across two adjacent resolutions, we do nearest neighbor upsampling of the lower resolution followed by an elementwise addition of the two sets of features. The topology of the hourglass is symmetric, so for every layer present on the way down there is a corresponding layer going up.\r\n\r\nAfter reaching the output resolution of the network, two consecutive rounds of 1x1 convolutions are applied to produce the final network predictions. The output of the network is a set of heatmaps where for a given [heatmap](https://paperswithcode.com/method/heatmap) the network predicts the probability of a joint’s presence at each and every pixel.",
  "title": "Stacked Hourglass Networks for Human Pose Estimation",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Accordion",
  "full_name": "Accordion",
  "description": "**Accordion** is a gradient communication scheduling algorithm that is generic across models while imposing low computational overheads. Accordion inspects the change in the gradient norms to detect critical regimes and adjusts the communication schedule dynamically. Accordion works for both adjusting the gradient compression rate or the batch size without additional parameter tuning.",
  "title": "Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "DeepZero",
  "full_name": "DeepZero",
  "description": "",
  "title": "DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training",
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "FASFA",
  "full_name": "FASFA: A Novel Next-Generation Backpropagation Optimizer",
  "description": "This paper introduces the fast adaptive stochastic function accelerator (FASFA) for gradient-based optimization of stochastic objective functions. It works based on Nesterov-enhanced first and second momentum estimates. The method is simple and effective during implementation because it has intuitive/familiar hyperparameterization. The training dynamics can be progressive or conservative depending on the decay rate sum. It works well with a low learning rate and mini batch size. Experiments and statistics showed convincing evidence that FASFA could be an ideal candidate for optimizing stochastic objective functions, particularly those generated by multilayer perceptrons with convolution and dropout layers. In addition, the convergence properties and regret bound provide results aligning with the online convex optimization framework. In a first of its kind, FASFA addresses the growing need for diverse optimizers by providing next-generation training dynamics for artificial intelligence algorithms. Future experiments could modify FASFA based on the infinity norm.",
  "title": null,
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "OTM",
  "full_name": "Optimal Transport Modeling",
  "description": "",
  "title": "Generative Modeling with Optimal Transport Maps",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "MNMF",
  "full_name": "Modularity preserving NMF",
  "description": "",
  "title": "Font Size: Community Preserving Network Embedding",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "PixLoc",
  "full_name": "PixLoc",
  "description": "**PixLoc** is a scene-agnostic neural network that estimates an accurate 6-DoF pose from an image and a 3D model. It is based on the direct alignment of multiscale deep features, casting camera localization as metric learning. PixLoc learns strong data priors by end-to-end training from pixels to pose and exhibits exceptional generalization to new scenes by separating model parameters and scene geometry. As the CNN never sees 3D points, PixLoc can generalize to any 3D structure available. This includes sparse SfM point clouds, dense depth maps from stereo or RGBD sensors, meshes, Lidar scans, but also lines and other primitives.",
  "title": "Back to the Feature: Learning Robust Camera Localization from Pixels to Pose",
  "collection": "6D Pose Estimation Models",
  "area": "Computer Vision"
}
{
  "name": "Cosine Normalization",
  "full_name": "Cosine Normalization",
  "description": "Multi-layer neural networks traditionally use  dot products between the output vector of previous layer and the incoming weight vector as the input to activation function. The result of dot product is unbounded. To bound dot product and decrease the variance, **Cosine Normalization** uses cosine similarity or centered cosine similarity (Pearson Correlation Coefficient) instead of dot products in neural networks. \r\n\r\nUsing cosine normalization, the output of a hidden unit is computed by:\r\n\r\n$$o = f(net_{norm})= f(\\cos \\theta) = f(\\frac{\\vec{w} \\cdot \\vec{x}} {\\left|\\vec{w}\\right|  \\left|\\vec{x}\\right|})$$\r\n\r\nwhere $net_{norm}$ is the normalized pre-activation,  $\\vec{w}$ is the incoming weight vector and $\\vec{x}$ is the input vector, ($\\cdot$) indicates dot product, $f$ is nonlinear activation function. Cosine normalization bounds the pre-activation between -1 and 1.",
  "title": "Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "PReLU-Net",
  "full_name": "PReLU-Net",
  "description": "**PReLU-Net** is a type of convolutional neural network that utilises parameterized ReLUs for its activation function. It also uses a robust initialization scheme - afterwards known as [Kaiming Initialization](https://paperswithcode.com/method/he-initialization) - that accounts for non-linear activation functions.",
  "title": "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Random Gaussian Blur",
  "full_name": "Random Gaussian Blur",
  "description": "**Random Gaussian Blur** is an image data augmentation technique where we randomly blur the image using a Gaussian distribution.\r\n\r\nImage Source: [Wikipedia](https://en.wikipedia.org/wiki/Gaussian_blur)",
  "title": null,
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "Concatenated Skip Connection",
  "full_name": "Concatenated Skip Connection",
  "description": "A **Concatenated Skip Connection** is a type of skip connection that seeks to reuse features by concatenating them to new layers, allowing more information to be retained from previous layers of the network. This contrasts with say, residual connections, where element-wise summation is used instead to incorporate information from previous layers. This type of skip connection is prominently used in DenseNets (and also Inception networks), which the Figure to the right illustrates.",
  "title": null,
  "collection": "Skip Connections",
  "area": "General"
}
{
  "name": "KE-MLM",
  "full_name": "Knowledge Enhanced Masked Language Model",
  "description": "",
  "title": "Knowledge Enhanced Masked Language Model for Stance Detection",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "GA",
  "full_name": "Genetic Algorithms",
  "description": "Genetic Algorithms are search algorithms that mimic Darwinian biological evolution in order to select and propagate better solutions.",
  "title": "Genetic Algorithms and the Traveling Salesman Problem a historical Review",
  "collection": "Heuristic Search Algorithms",
  "area": "Reinforcement Learning"
}
{
  "name": "TorchBeast",
  "full_name": "TorchBeast",
  "description": "**TorchBeast** is a platform for reinforcement learning (RL) research in PyTorch. It implements a version of the popular [IMPALA](https://paperswithcode.com/method/impala) algorithm for fast, asynchronous, parallel training of RL agents.",
  "title": "TorchBeast: A PyTorch Platform for Distributed RL",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "EdgeBoxes",
  "full_name": "EdgeBoxes",
  "description": "**EdgeBoxes** is an approach for generating object bounding box proposals directly from edges. Similar to segments, edges provide a simplified but informative representation of an image. In fact, line drawings of an image can accurately convey the high-level information contained in an image\r\nusing only a small fraction of the information. \r\n\r\nThe main insight behind the method is the observation: the number of contours wholly enclosed by a bounding box is indicative of the likelihood of the box containing an object. We say a contour is wholly enclosed by a box if all edge pixels belonging to the contour lie within the interior of the box. Edges tend to correspond to object boundaries, and as such boxes that tightly enclose a set of edges are likely to contain an object. However, some edges that lie within an object’s bounding box may not be part of the contained object. Specifically, edge pixels that belong to contours straddling the box’s boundaries are likely to correspond to objects or structures that lie outside the box.\r\n\r\nSource: [Zitnick and Dollar](https://pdollar.github.io/files/papers/ZitnickDollarECCV14edgeBoxes.pdf)",
  "title": null,
  "collection": "Region Proposal",
  "area": "Computer Vision"
}
{
  "name": "DOLG",
  "full_name": "Deep Orthogonal Fusion of Local and Global Features",
  "description": "Image Retrieval is a fundamental task of obtaining images similar to the query one from a database. A common image retrieval practice is to firstly retrieve candidate images via similarity search using global image features and then re-rank the candidates by leveraging their\r\nlocal features. Previous learning-based studies mainly focus on either global or local image representation learning\r\nto tackle the retrieval task. In this paper, we abandon the\r\ntwo-stage paradigm and seek to design an effective singlestage solution by integrating local and global information\r\ninside images into compact image representations. Specifically, we propose a Deep Orthogonal Local and Global\r\n(DOLG) information fusion framework for end-to-end image retrieval. It attentively extracts representative local information with multi-atrous convolutions and self-attention\r\nat first. Components orthogonal to the global image representation are then extracted from the local information.\r\nAt last, the orthogonal components are concatenated with\r\nthe global representation as a complementary, and then aggregation is performed to generate the final representation.\r\nThe whole framework is end-to-end differentiable and can\r\nbe trained with image-level labels. Extensive experimental\r\nresults validate the effectiveness of our solution and show\r\nthat our model achieves state-of-the-art image retrieval performances on Revisited Oxford and Paris datasets.",
  "title": "DOLG: Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local and Global Features",
  "collection": "Image Retrieval Models",
  "area": "Computer Vision"
}
{
  "name": "SSD",
  "full_name": "SSD",
  "description": "**SSD** is a single-stage object detection method that discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. \r\n\r\nThe fundamental improvement in speed comes from eliminating bounding box proposals and the subsequent pixel or feature resampling stage. Improvements over competing single-stage methods include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales.",
  "title": "SSD: Single Shot MultiBox Detector",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "Hierarchical Softmax",
  "full_name": "Hierarchical Softmax",
  "description": "**Hierarchical Softmax** is a is an alternative to [softmax](https://paperswithcode.com/method/softmax) that is faster to evaluate: it is $O\\left(\\log{n}\\right)$ time to evaluate compared to $O\\left(n\\right)$ for softmax. It utilises a multi-layer binary tree, where the probability of a word is calculated through the product of probabilities on each edge on the path to that node. See the Figure to the right for an example of where the product calculation would occur for the word \"I'm\".\r\n\r\n(Introduced by Morin and Bengio)\r\n\r\nImage Credit: [Steven Schmatz](https://www.quora.com/profile/Steven-Schmatz)",
  "title": null,
  "collection": "Output Functions",
  "area": "General"
}
{
  "name": "Mirror-BERT",
  "full_name": "Mirror-BERT",
  "description": "Mirror-BERT converts pretrained language models into effective universal text encoders without any supervision, in 20-30 seconds. It is an extremely simple, fast, and effective contrastive learning technique. It relies on fully identical *or* slightly modified string pairs as positive (i.e., synonymous) fine-tuning examples, and aims to maximise their similarity during identity fine-tuning.",
  "title": "Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "DAC",
  "full_name": "Dynamic Algorithm Configuration",
  "description": "Dynamic algorithm configuration (DAC) is capable of generalizing over prior optimization approaches, as well as handling optimization of hyperparameters that need to be adjusted over multiple time-steps.\r\n\r\nImage Source: [Biedenkapp et al.](http://ecai2020.eu/papers/1237_paper.pdf)",
  "title": "Dynamic Algorithm Configuration: Foundation of a New Meta-Algorithmic Framework",
  "collection": "Hyperparameter Search",
  "area": "General"
}
{
  "name": "ResNeXt",
  "full_name": "ResNeXt",
  "description": "A **ResNeXt** repeats a building block that aggregates a set of transformations with the same topology. Compared to a [ResNet](https://paperswithcode.com/method/resnet), it exposes a new dimension,  *cardinality* (the size of the set of transformations) $C$, as an essential factor in addition to the dimensions of depth and width. \r\n\r\nFormally, a set of aggregated transformations can be represented as: $\\mathcal{F}(x)=\\sum_{i=1}^{C}\\mathcal{T}_i(x)$, where $\\mathcal{T}_i(x)$ can be an arbitrary function. Analogous to a simple neuron, $\\mathcal{T}_i$ should project $x$ into an (optionally low-dimensional) embedding and then transform it.",
  "title": "Aggregated Residual Transformations for Deep Neural Networks",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Discriminative Adversarial Search",
  "full_name": "Discriminative Adversarial Search",
  "description": "**Discriminative Adversarial Search**, or **DAS**, is a sequence decoding approach which aims to alleviate the effects of exposure bias and to optimize on the data distribution itself rather than for external metrics. Inspired by generative adversarial networks (GANs), wherein a discriminator is used to improve the generator, DAS differs from GANs in that the generator parameters are not updated at training time and the discriminator is only used to drive sequence generation at inference time.",
  "title": "Discriminative Adversarial Search for Abstractive Summarization",
  "collection": "Sequence Decoding Methods",
  "area": "Natural Language Processing"
}
{
  "name": "GhostNet",
  "full_name": "GhostNet",
  "description": "A **GhostNet** is a type of convolutional neural network that is built using Ghost modules, which aim to generate more features by using fewer parameters (allowing for greater efficiency). \r\n\r\nGhostNet mainly consists of a stack of Ghost bottlenecks with the Ghost modules as the building block. The first layer is a standard convolutional layer with 16 filters, then a series of Ghost bottlenecks with gradually increased channels follow. These Ghost bottlenecks are grouped into different stages according to the sizes of their input feature maps. All the Ghost bottlenecks are applied with stride=1 except that the last one in each stage is with stride=2. At last a [global average pooling](https://paperswithcode.com/method/global-average-pooling) and a convolutional layer are utilized to transform the feature maps to a 1280-dimensional feature vector for final classification. The squeeze and excite (SE) module is also applied to the residual layer in some ghost bottlenecks. \r\n\r\nIn contrast to [MobileNetV3](https://paperswithcode.com/method/mobilenetv3), GhostNet does not use [hard-swish](https://paperswithcode.com/method/hard-swish) nonlinearity function due to its large latency.",
  "title": "GhostNet: More Features from Cheap Operations",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "MoViNet",
  "full_name": "MoViNet",
  "description": "**Mobile Video Network**, or **MoViNet**, is a type of computation and memory efficient video network that can operate on streaming video for online inference. Three techniques are used to improve efficiency while reducing the peak memory usage of 3D CNNs. First, a video network search space is designed and [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) employed to generate efficient and diverse 3D CNN architectures. Second, a Stream Buffer technique is introduced that decouples memory from video clip duration, allowing 3D CNNs to embed arbitrary-length streaming video sequences for both training and inference with a small constant memory footprint. Third, a simple ensembling technique is used to improve accuracy further without sacrificing efficiency.",
  "title": "MoViNets: Mobile Video Networks for Efficient Video Recognition",
  "collection": "Video Recognition Models",
  "area": "Computer Vision"
}
{
  "name": "Position-Wise Feed-Forward Layer",
  "full_name": "Position-Wise Feed-Forward Layer",
  "description": "**Position-Wise Feed-Forward Layer** is a type of [feedforward layer](https://www.paperswithcode.com/method/category/feedforwad-networks) consisting of two [dense layers](https://www.paperswithcode.com/method/dense-connections) that applies to the last dimension, which means the same dense layers are used for each position item in the sequence, so called position-wise.",
  "title": "Attention Is All You Need",
  "collection": "Feedforward Networks",
  "area": "General"
}
{
  "name": "CrossViT",
  "full_name": "CrossViT",
  "description": "**CrossViT** is a type of [vision transformer](https://paperswithcode.com/method/vision-transformer) that uses a dual-branch architecture to extract multi-scale feature representations for image classification. The architecture combines image patches (i.e. tokens in a [transformer](https://paperswithcode.com/method/transformer)) of different sizes to produce stronger visual features for image classification. It processes small and large patch tokens with two separate branches of different computational complexities and these tokens are fused together multiple times to complement each other.\r\n\r\nFusion is achieved by an efficient [cross-attention module](https://paperswithcode.com/method/cross-attention-module), in which each transformer branch creates a non-patch token as an agent to exchange information with the other branch by attention. This allows for linear-time generation of the attention map in fusion instead of quadratic time otherwise.",
  "title": "CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "Cycle-CenterNet",
  "full_name": "Cycle-CenterNet",
  "description": "**Cycle-CenterNet** is a table structure parsing approach built on [CenterNet](https://paperswithcode.com/method/centernet) that uses a cycle-pairing module to simultaneously detect and group tabular cells into structured tables. It also utilizes a pairing loss which enables the grouping of discrete cells into the structured tables.",
  "title": "Parsing Table Structures in the Wild",
  "collection": "Table Parsing Models",
  "area": "General"
}
{
  "name": "GreedyNAS",
  "full_name": "GreedyNAS",
  "description": "**GreedyNAS** is a one-shot [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) method. Previous methods held the assumption that a supernet should give a reasonable ranking over all paths. They thus treat all paths equally, and spare much effort to train paths. However, it is harsh for a single supernet to evaluate accurately on such a huge-scale search space (eg, $7^{21}$). GreedyNAS eases the burden of supernet by encouraging focus more on evaluation of potentially-good candidates, which are identified using a surrogate portion of validation data. \r\n\r\nConcretely, during training, GreedyNAS utilizes a multi-path sampling strategy with rejection, and greedily filters the weak paths. The training efficiency is thus boosted since the training space has been greedily shrunk from all paths to those potentially-good ones. An exploration and exploitation policy is adopted by introducing an empirical candidate path pool.",
  "title": "GreedyNAS: Towards Fast One-Shot NAS with Greedy Supernet",
  "collection": "Neural Architecture Search",
  "area": "General"
}
{
  "name": "VDO-SLAM",
  "full_name": "VDO-SLAM",
  "description": "**VDO-SLAM** is a feature-based stereo/RGB-D dynamic SLAM system that leverages image-based semantic information to simultaneously localise the robot, map the static and dynamic structure, and track motions of rigid objects in the scene. Input images are first pre-processed to generate instance-level object segmentation and dense optical flow. These are then used to track features on static background structure and dynamic objects. Camera poses and object motions estimated from feature tracks are then refined in a global batch optimisation, and a local map is maintained and updated with every new frame.",
  "title": "VDO-SLAM: A Visual Dynamic Object-aware SLAM System",
  "collection": "SLAM Methods",
  "area": "General"
}
{
  "name": "Temporal attention",
  "full_name": "Temporal attention",
  "description": "Temporal attention can be seen as a dynamic time selection mechanism determining when to pay attention, and is thus usually used for video processing.",
  "title": "Jointly Attentive Spatial-Temporal Pooling Networks for Video-based Person Re-Identification",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "SETR",
  "full_name": "Segmentation Transformer",
  "description": "**Segmentation Transformer**, or **SETR**, is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based segmentation model. The transformer-alone encoder treats an input image as a sequence of image patches represented by learned patch embedding, and transforms the sequence with global self-attention modeling for discriminative feature representation learning. Concretely, we first decompose an image into a grid of fixed-sized patches, forming a sequence of patches. With a linear embedding layer applied to the flattened pixel vectors of every patch, we then obtain a sequence of feature embedding vectors as the input to a transformer. Given the learned features from the encoder\r\ntransformer, a decoder is then used to recover the original image resolution. Crucially there is no downsampling in spatial resolution but global context modeling at every layer of the encoder transformer.",
  "title": "Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "TWEC",
  "full_name": "Temporal Word Embeddings with a Compass",
  "description": "TWEC is a method to generate temporal word embeddings: this method is efficient and it is based on a simple heuristic: we train an atemporal word embedding, the compass and we use this embedding to freeze one of the layers of the CBOW architecture. The frozen architecture is then used to train time-specific slices that are all comparable after training.",
  "title": "Training Temporal Word Embeddings with a Compass",
  "collection": "Word Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "CondInst",
  "full_name": "Conditional Convolutions for Instance Segmentation",
  "description": "CondInst is a simple yet effective instance segmentation framework. It eliminates ROI cropping and feature alignment with the instance-aware mask heads. As a result, CondInst can solve instance segmentation with fully convolutional networks. CondInst is able to produce high-resolution instance masks without longer computational time. Extensive experiments show that CondInst can achieve even better performance and inference speed than [Mask R-CNN](https://paperswithcode.com/method/mask-r-cnn). It can be a strong alternative to previous ROI-based instance segmentation methods. Code is at https://github.com/aim-uofa/AdelaiDet.",
  "title": "Conditional Convolutions for Instance Segmentation",
  "collection": "Instance Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "Disp R-CNN",
  "full_name": "Disp R-CNN",
  "description": "**Disp R-CNN** is a 3D object detection system for stereo images. It utilizes an instance disparity estimation network (iDispNet) that predicts disparity only for pixels on objects of interest and learns a category-specific shape prior for more accurate disparity estimation. To address the challenge from scarcity of disparity annotation in training, a statistical shape model is used to generate dense disparity pseudo-ground-truth without the need of LiDAR point clouds.",
  "title": "Disp R-CNN: Stereo 3D Object Detection via Shape Prior Guided Instance Disparity Estimation",
  "collection": "3D Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "Meta Pseudo Labels",
  "full_name": "Meta Pseudo Labels",
  "description": "**Meta Pseudo Labels** is a semi-supervised learning method that uses a teacher network to generate pseudo labels on unlabeled data to teach a student network. The teacher receives feedback from the student to inform the teacher to generate better pseudo labels. This feedback signal is used as a reward to train the teacher throughout the course of the student’s learning.",
  "title": "Meta Pseudo Labels",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "Random Scaling",
  "full_name": "Random Scaling",
  "description": "**Random Scaling** is a type of image data augmentation where we randomly change the scale the image between a specified range.",
  "title": null,
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "Auto-Classifier",
  "full_name": "Auto-Classifier",
  "description": "",
  "title": "Auto-Classifier: A Robust Defect Detector Based on an AutoML Head",
  "collection": "AutoML",
  "area": "General"
}
{
  "name": "PCKMeans",
  "full_name": "Pairwise Constrained KMeans",
  "description": "A variant of the popular k-means algorithm that integrates constraint satisfaction into its objective function.\r\n\r\nOriginal paper : Active Semi-Supervision for Pairwise Constrained Clustering, Basu et al. 2004",
  "title": null,
  "collection": "Clustering",
  "area": "General"
}
{
  "name": "DDQL",
  "full_name": "Double Deep Q-Learning",
  "description": "",
  "title": "Deep Reinforcement Learning with Double Q-learning",
  "collection": "Reinforcement Learning Frameworks",
  "area": "Reinforcement Learning"
}
{
  "name": "ATMO",
  "full_name": "AdapTive Meta Optimizer",
  "description": "This method combines multiple optimization techniques like [ADAM](https://paperswithcode.com/method/adam) and [SGD](https://paperswithcode.com/method/sgd) or PADAM. This method can be applied to any couple of optimizers.\r\n\r\nImage credit: [Combining Optimization Methods Using an Adaptive Meta Optimizer](https://www.mdpi.com/1999-4893/14/6/186)",
  "title": "Combining Optimization Methods Using an Adaptive Meta Optimizer",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "KP",
  "full_name": "Kollen-Pollack Learning",
  "description": "",
  "title": "Deep Learning without Weight Transport",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Local Response Normalization",
  "full_name": "Local Response Normalization",
  "description": "**Local Response Normalization** is a normalization layer that implements the idea of lateral inhibition. Lateral inhibition is a concept in neurobiology that refers to the phenomenon of an excited neuron inhibiting its neighbours: this leads to a peak in the form of a local maximum, creating contrast in that area and increasing sensory perception. In practice, we can either normalize within the same channel or normalize across channels when we apply LRN to convolutional neural networks.\r\n\r\n$$ b_{c} = a_{c}\\left(k + \\frac{\\alpha}{n}\\sum_{c'=\\max(0, c-n/2)}^{\\min(N-1,c+n/2)}a_{c'}^2\\right)^{-\\beta} $$\r\n\r\nWhere the size is the number of neighbouring channels used for normalization, $\\alpha$ is multiplicative factor, $\\beta$ an exponent and $k$ an additive factor",
  "title": "ImageNet Classification with Deep Convolutional Neural Networks",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "Inception-ResNet-v2-C",
  "full_name": "Inception-ResNet-v2-C",
  "description": "**Inception-ResNet-v2-C** is an image model block for an 8 x 8 grid used in the [Inception-ResNet-v2](https://paperswithcode.com/method/inception-resnet-v2) architecture. It largely follows the idea of Inception modules - and grouped convolutions - but also includes residual connections.",
  "title": "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Pointwise Convolution",
  "full_name": "Pointwise Convolution",
  "description": "**Pointwise Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) that uses a 1x1 kernel: a kernel that iterates through every single point. This kernel has a depth of however many channels the input image has. It can be used in conjunction with [depthwise convolutions](https://paperswithcode.com/method/depthwise-convolution) to produce an efficient class of convolutions known as [depthwise-separable convolutions](https://paperswithcode.com/method/depthwise-separable-convolution).\r\n\r\nImage Credit: [Chi-Feng Wang](https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)",
  "title": null,
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "AutoML-Zero",
  "full_name": "AutoML-Zero",
  "description": "**AutoML-Zero** is an AutoML technique that aims to search a fine-grained space simultaneously for the model, optimization procedure, initialization, and so on, permitting much less human-design and even allowing the discovery of non-neural network algorithms. It represents ML algorithms as computer programs comprised of three component functions, Setup, Predict, and Learn, that performs initialization, prediction and learning. The instructions in these functions apply basic mathematical operations on a small memory. The operation and memory addresses used by each instruction are free parameters in the search space, as is the size of the component functions. While this reduces expert design, the consequent sparsity means that [random search](https://paperswithcode.com/method/random-search) cannot make enough progress. To overcome this difficulty, the authors use small proxy tasks and migration techniques to build an optimized infrastructure capable of searching through 10,000 models/second/cpu core.\r\n\r\nEvolutionary methods can find solutions in the AutoML-Zero search space despite its enormous\r\nsize and sparsity. The authors show that by randomly modifying the programs and periodically selecting the best performing ones on given tasks/datasets, AutoML-Zero discovers reasonable algorithms. They start from empty programs and using data labeled by “teacher” neural networks with random weights, and demonstrate  evolution can discover neural networks trained by gradient descent. Following this, they minimize bias toward known algorithms by switching to binary classification tasks extracted from CIFAR-10 and allowing a larger set of possible operations. This discovers interesting techniques like multiplicative interactions, normalized gradient and weight averaging. Finally, they show it is possible for evolution to adapt the algorithm to the type of task provided. For example, [dropout](https://paperswithcode.com/method/dropout)-like operations emerge when the task needs regularization and learning rate decay appears when the task requires faster convergence.",
  "title": "AutoML-Zero: Evolving Machine Learning Algorithms From Scratch",
  "collection": "AutoML",
  "area": "General"
}
{
  "name": "LSTM",
  "full_name": "Long Short-Term Memory",
  "description": "An **LSTM** is a type of [recurrent neural network](https://paperswithcode.com/methods/category/recurrent-neural-networks) that addresses the vanishing gradient problem in vanilla RNNs through additional cells, input and output gates. Intuitively, vanishing gradients are solved through additional *additive* components, and forget gate activations, that allow the gradients to flow through the network without vanishing as quickly.\r\n\r\n(Image Source [here](https://medium.com/datadriveninvestor/how-do-lstm-networks-solve-the-problem-of-vanishing-gradients-a6784971a577))\r\n\r\n(Introduced by Hochreiter and Schmidhuber)",
  "title": null,
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "Symbolic rule learning",
  "full_name": "Symbolic rule learning",
  "description": "Symbolic rule learning methods find regularities in data that can be expressed in the form of 'if-then' rules based on symbolic representations of the data.",
  "title": "Scalable and interpretable rule-based link prediction for large heterogeneous knowledge graphs",
  "collection": "Rule-based systems",
  "area": "General"
}
{
  "name": "Macaw",
  "full_name": "Macaw",
  "description": "**Macaw** is a generative question-answering (QA) system that is built on UnifiedQA, itself built on [T5](https://paperswithcode.com/method/t5). Macaw has three interesting features. First, it often produces high-quality answers to questions far outside the domain it was trained on, sometimes surprisingly so. Second, Macaw allows different permutations (“an gles”) of inputs and outputs to be used. For example, we can give it a question and get an answer; or give it an answer and get a question; or give it a question and answer and get a set of multiple-choice (MC) options for that question. This multi-angle QA capability allows versatility in the way Macaw can be used, include recursively using outputs as new inputs to the system. Finally, Macaw also generates explanations as an optional output (or even input) element.",
  "title": "General-Purpose Question-Answering with Macaw",
  "collection": "Question Answering Models",
  "area": "Natural Language Processing"
}
{
  "name": "Lion",
  "full_name": "Evolved Sign Momentum",
  "description": "The Lion optimizer is discovered by symbolic program search. It is more memory-efficient than most adaptive optimizers as it only needs to momentum. The update of Lion is produced by the sign function.",
  "title": null,
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "YOHO",
  "full_name": "You Only Hypothesize Once",
  "description": "**You Only Hypothesize Once** is a local descriptor-based framework for the registration of two unaligned point clouds. The proposed descriptor achieves the rotation invariance by recent technologies of group equivariant feature learning, which brings more robustness to point density and noise. The descriptor in YOHO also has a rotation-equivariant part, which enables the estimation the registration from just one correspondence hypothesis.",
  "title": "You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors",
  "collection": "Point Cloud Models",
  "area": "Computer Vision"
}
{
  "name": "Effective Squeeze-and-Excitation Block",
  "full_name": "Effective Squeeze-and-Excitation Block",
  "description": "**Effective Squeeze-and-Excitation Block** is an image model block based on squeeze-and-excitation, the difference being that one less FC layer is used. The authors note the SE module has a limitation: channel information loss due to dimension reduction. For avoiding high model complexity burden, two FC layers of the SE module need to reduce channel dimension. Specifically, while the first FC layer reduces input feature channels $C$ to $C/r$ using reduction ratio $r$, the second FC layer expands the reduced channels to original channel size $C$. As a result, this channel dimension reduction causes channel information loss. Therefore, effective SE (eSE) uses only one FC layer with $C$ channels instead of two FCs without channel dimension reduction, which maintains channel information.",
  "title": "CenterMask : Real-Time Anchor-Free Instance Segmentation",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "ESP",
  "full_name": "Efficient Spatial Pyramid",
  "description": "An **Efficient Spatial Pyramid (ESP)** is an image model block based on a factorization principle that decomposes a standard [convolution](https://paperswithcode.com/method/convolution) into two steps: (1) point-wise convolutions and (2) spatial pyramid of dilated convolutions. The point-wise convolutions help in reducing the computation, while the spatial pyramid of dilated convolutions re-samples the feature maps to learn the representations from large effective receptive field. This allows for increased efficiency compared to another image blocks like [ResNeXt](https://paperswithcode.com/method/resnext) blocks and Inception modules.",
  "title": "ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "WideResNet",
  "full_name": "WideResNet",
  "description": "**Wide Residual Networks** are a variant on [ResNets](https://paperswithcode.com/method/resnet) where we decrease depth and increase the width of residual networks. This is achieved through the use of wide residual blocks.",
  "title": "Wide Residual Networks",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "LEAF",
  "full_name": "Learnable Extended Activation Function",
  "description": "",
  "title": "Learnable Extended Activation Function (LEAF) for Deep Neural Networks",
  "collection": "Adaptive Activation Functions",
  "area": "General"
}
{
  "name": "AdaptiveBins",
  "full_name": "Adaptive Bins",
  "description": "",
  "title": "AdaBins: Depth Estimation using Adaptive Bins",
  "collection": "Adaptive Computation",
  "area": "General"
}
{
  "name": "ClusterFit",
  "full_name": "ClusterFit",
  "description": "**ClusterFit** is a self-supervision approach for learning image representations.  Given a dataset, we (a) cluster its features extracted from a pre-trained network using k-means and (b) re-train a new network from scratch on this dataset using cluster assignments as pseudo-labels.",
  "title": "ClusterFit: Improving Generalization of Visual Representations",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "Hierarchical-Split Block",
  "full_name": "Hierarchical-Split Block",
  "description": "**Hierarchical-Split Block** is a representational block for multi-scale feature representations. It contains many hierarchical split and concatenate connections within one single [residual block](https://paperswithcode.com/methods/category/skip-connection-blocks). \r\n\r\nSpecifically, ordinary feature maps in deep neural networks are split into $s$ groups, each with $w$ channels. As shown in the Figure, only the first group of filters can be straightly connected to next layer. The second group of feature maps are sent to a convolution of $3 \\times 3$ filters to extract features firstly, then the output feature maps are split into two sub-groups in the channel dimension. One sub-group of feature maps straightly connected to next layer, while the other sub-group is concatenated with the next group of input feature maps in the channel dimension. The concatenated feature maps are operated by a set of $3 \\times 3$ convolutional filters. This process repeats several times until the rest of input feature maps are processed. Finally, features maps from all input groups are concatenated and sent to another layer of $1 \\times 1$ filters to rebuild the features.",
  "title": "HS-ResNet: Hierarchical-Split Block on Convolutional Neural Network",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "ProxyOptimization",
  "full_name": "Proxy Optimization for initial Proxies in Proxy Anchor Loss",
  "description": "",
  "title": "VeriMedi: Pill Identification using Proxy-based Deep Metric Learning and Exact Solution",
  "collection": "Initialization",
  "area": "General"
}
{
  "name": "AdvProp",
  "full_name": "AdvProp",
  "description": "**AdvProp** is an adversarial training scheme which treats adversarial examples as additional examples, to prevent overfitting. Key to the method is the usage of a separate auxiliary batch norm for adversarial examples, as they have different underlying distributions to normal examples.",
  "title": "Adversarial Examples Improve Image Recognition",
  "collection": "Adversarial Training",
  "area": "General"
}
{
  "name": "ASLFeat",
  "full_name": "ASLFeat",
  "description": "**ASLFeat** is a convolutional neural network for learning local features that uses deformable convolutional networks to densely estimate and apply local transformation. It also takes advantage of the inherent feature hierarchy to restore spatial resolution and low-level details for accurate keypoint localization. Finally, it uses a peakiness measurement to relate feature responses and derive more indicative detection scores.",
  "title": "ASLFeat: Learning Local Features of Accurate Shape and Localization",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "VGG-16",
  "full_name": "VGG-16",
  "description": "",
  "title": "Very Deep Convolutional Networks for Large-Scale Image Recognition",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "K-Maximal Word Allocation",
  "full_name": "K-Maximal Word Allocation",
  "description": "",
  "title": "DiMSum: Distributed and Multilingual Summarization of Financial Narratives",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "Random Grayscale",
  "full_name": "Random Grayscale",
  "description": "**Random Grayscale**  is an image data augmentation that converts an image to grayscale with probability $p$.",
  "title": null,
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "T-D",
  "full_name": "Transformer Decoder",
  "description": "[Transformer](https://paperswithcode.com/method/transformer)-Decoder is a modification to Transformer-Encoder-Decoder for long sequences that drops the encoder\r\nmodule, combines the input and output sequences into a single ”sentence” and is trained as a standard language model. It is used in [GPT](https://paperswithcode.com/method/gpt) and later revisions.",
  "title": "Generating Wikipedia by Summarizing Long Sequences",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "PCT",
  "full_name": "Perceptual control theoretic architecture",
  "description": "",
  "title": "PCT and Beyond: Towards a Computational Framework for `Intelligent' Communicative Systems",
  "collection": "Machine Translation Models",
  "area": "Natural Language Processing"
}
{
  "name": "Dilated Convolution",
  "full_name": "Dilated Convolution",
  "description": "**Dilated Convolutions** are a type of [convolution](https://paperswithcode.com/method/convolution) that “inflate” the kernel by inserting holes between the kernel elements. An additional parameter $l$ (dilation rate) indicates how much the kernel is widened. There are usually $l-1$ spaces inserted between kernel elements. \r\n\r\nNote that concept has existed in past literature under different names, for instance the *algorithme a trous*,  an algorithm for wavelet decomposition (Holschneider et al., 1987; Shensa, 1992).",
  "title": "Multi-Scale Context Aggregation by Dilated Convolutions",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "HalluciNet",
  "full_name": "Approximating Spatiotemporal Representations Using a 2DCNN",
  "description": "Approximating Spatiotemporal Representations Using a 2DCNN",
  "title": "HalluciNet-ing Spatiotemporal Representations Using a 2D-CNN",
  "collection": "Action Recognition Models",
  "area": "Computer Vision"
}
{
  "name": "TransE",
  "full_name": "TransE",
  "description": "**TransE** is an energy-based model that produces knowledge base embeddings. It models relationships by interpreting them as translations operating on the low-dimensional embeddings of the entities. Relationships are represented as translations in the embedding space: if $\\left(h, \\mathcal{l}, t\\right)$ holds, the embedding of the tail entity $t$ should be close to the embedding of the head entity $h$ plus some vector that depends on the relationship $\\mathcal{l}$.",
  "title": "Translating Embeddings for Modeling Multi-relational Data",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "CIDA",
  "full_name": "Continuously Indexed Domain Adaptation",
  "description": "**Continuously Indexed Domain Adaptation** combines traditional adversarial adaptation with a novel discriminator that models the encoding-conditioned domain index distribution.\r\n\r\nImage Source: [Wang et al.](https://arxiv.org/pdf/2007.01807v2.pdf)",
  "title": "Continuously Indexed Domain Adaptation",
  "collection": "Adversarial Training",
  "area": "General"
}
{
  "name": "BLOOMZ",
  "full_name": "BLOOMZ",
  "description": "**BLOOMZ** is a Multitask prompted finetuning (MTF) variant of BLOOM.",
  "title": "Crosslingual Generalization through Multitask Finetuning",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "IQL",
  "full_name": "Implicit Q-Learning",
  "description": "",
  "title": "Offline Reinforcement Learning with Implicit Q-Learning",
  "collection": "Offline Reinforcement Learning Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "TGN",
  "full_name": "Temporal Graph Network",
  "description": "**Temporal Graph Network**, or **TGN**, is a framework for deep learning on dynamic graphs represented as sequences of timed events. The memory (state) of the model at time $t$ consists of a vector $\\mathbf{s}_i(t)$ for each node $i$ the model has seen so far. The memory of a node is updated after an event (e.g. interaction with another node or node-wise change), and its purpose is to represent the node's history in a compressed format. Thanks to this specific module, TGNs have the capability to memorize long term dependencies for each node in the graph. When a new node is encountered, its memory is initialized as the zero vector, and it is then updated for each event involving the node, even after the model has finished training.",
  "title": "Temporal Graph Networks for Deep Learning on Dynamic Graphs",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "WaveGlow",
  "full_name": "WaveGlow",
  "description": "**WaveGlow** is a flow-based generative model that generates audio by sampling from a distribution. Specifically samples are taken from a zero mean spherical Gaussian with the same number of dimensions as our desired output, and those samples are put through a series of layers that transforms the simple distribution to one which has the desired distribution.",
  "title": "WaveGlow: A Flow-based Generative Network for Speech Synthesis",
  "collection": "Generative Audio Models",
  "area": "Audio"
}
{
  "name": "SeqINT",
  "full_name": "Sequential Information Threading",
  "description": "Unsupervised machine learning approach for identifying information threads by leveraging answers to 5W1H questions from documents, the temporal relationships between documents and hierarchical agglomerative clustering (HAC).",
  "title": "Identifying chronological and coherent information threads using 5W1H questions and temporal relationships",
  "collection": null,
  "area": null
}
{
  "name": "Mixup",
  "full_name": "Mixup",
  "description": "**Mixup** is a data augmentation technique that generates a weighted combination of random image pairs from the training data. Given two images and their ground truth labels: $\\left(x\\_{i}, y\\_{i}\\right), \\left(x\\_{j}, y\\_{j}\\right)$, a synthetic training example $\\left(\\hat{x}, \\hat{y}\\right)$ is generated as:\r\n\r\n$$ \\hat{x} = \\lambda{x\\_{i}} + \\left(1 − \\lambda\\right){x\\_{j}} $$\r\n$$ \\hat{y} = \\lambda{y\\_{i}} + \\left(1 − \\lambda\\right){y\\_{j}} $$\r\n\r\nwhere $\\lambda \\sim \\text{Beta}\\left(\\alpha = 0.2\\right)$ is independently sampled for each augmented example.",
  "title": "mixup: Beyond Empirical Risk Minimization",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "DenseNet-Elastic",
  "full_name": "DenseNet-Elastic",
  "description": "**DenseNet-Elastic** is a convolutional neural network that is a modification of a [DenseNet](https://paperswithcode.com/method/densenet) with elastic blocks (extra upsampling and downsampling).",
  "title": "ELASTIC: Improving CNNs with Dynamic Scaling Policies",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "IoU-Net",
  "full_name": "IoU-Net",
  "description": "**IoU-Net** is an object detection architecture that introduces localization confidence. IoU-Net learns to predict the IoU between each detected bounding box and the matched ground-truth. The network acquires this confidence of localization, which improves the NMS procedure by preserving accurately localized bounding boxes. Furthermore, an optimization-based bounding box refinement method is proposed, where the predicted IoU is formulated as the objective.",
  "title": "Acquisition of Localization Confidence for Accurate Object Detection",
  "collection": "Localization Models",
  "area": "Computer Vision"
}
{
  "name": "MoGA-A",
  "full_name": "MoGA-A",
  "description": "**MoGA-A** is a convolutional neural network optimized for mobile latency and discovered via Mobile GPU-Aware (MoGA) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search). The basic building block is MBConvs (inverted residual blocks) from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2). Squeeze-and-excitation layers are also experimented with.",
  "title": "MoGA: Searching Beyond MobileNetV3",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Directional Sparse Filtering",
  "full_name": "Directional Sparse FIltering",
  "description": "",
  "title": "Learning complex-valued latent filters with absolute cosine similarity",
  "collection": "Speech Separation Models",
  "area": "Audio"
}
{
  "name": "GFP-GAN",
  "full_name": "GFP-GAN",
  "description": "**GFP-GAN** is a generative adversarial network for blind face restoration that leverages a generative facial prior (GFP). This Generative Facial Prior (GFP) is incorporated into the face restoration process via channel-split spatial feature transform layers, which allow for a good balance between realness and fidelity. As a whole, the GFP-GAN consists of a degradation removal module ([U-Net](https://paperswithcode.com/method/u-net)) and a pretrained face  [StyleGAN](https://paperswithcode.com/method/stylegan) as a facial prior. They are bridged by a latent code mapping and several Channel-Split [Spatial Feature Transform](https://paperswithcode.com/method/spatial-feature-transform) (CS-SFT) layers. During training, 1) intermediate restoration losses are employed to remove complex degradation, 2) Facial component loss with discriminators is used to enhance facial details, and 3) identity preserving loss is used to retain face identity.",
  "title": "Towards Real-World Blind Face Restoration with Generative Facial Prior",
  "collection": "Face Restoration Models",
  "area": "Computer Vision"
}
{
  "name": "Adafactor",
  "full_name": "Adafactor",
  "description": "**Adafactor** is a stochastic optimization method based on [Adam](https://paperswithcode.com/method/adam) that reduces memory usage while retaining the empirical benefits of adaptivity. This is achieved through maintaining a factored representation of the squared gradient accumulator across training steps. Specifically, by tracking moving averages of the row and column sums of the squared gradients for matrix-valued variables, we are able to reconstruct a low-rank approximation of the exponentially smoothed accumulator at each training step that is optimal with respect to the generalized Kullback-Leibler divergence. For an $n \\times m$ matrix, this reduces the memory requirements from $O(n m)$ to $O(n + m)$. \r\n\r\nInstead of defining the optimization algorithm in terms of absolute step sizes {$\\alpha_t$}$\\_{t=1}^T$, the authors define the optimization algorithm in terms of relative step sizes {$\\rho_t$}$\\_{t=1}^T$, which get multiplied by the scale of the parameters. The scale of a parameter vector or matrix is defined as the root-mean-square of its components, lower-bounded by a small constant $\\epsilon_2$.  The reason for this lower bound is to allow zero-initialized parameters to escape 0. \r\n\r\nProposed hyperparameters are: $\\epsilon\\_{1} = 10^{-30}$, $\\epsilon\\_{2} = 10^{-3}$, $d=1$, $p\\_{t} = \\min\\left(10^{-2}, \\frac{1}{\\sqrt{t}}\\right)$, $\\hat{\\beta}\\_{2\\_{t}} = 1 - t^{-0.8}$.",
  "title": "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "AWARE",
  "full_name": "Attentive Walk-Aggregating Graph Neural Network",
  "description": "We propose to theoretically and empirically examine the effect of incorporating weighting schemes into walk-aggregating GNNs. To this end, we propose a simple, interpretable, and end-to-end supervised GNN model, called AWARE (Attentive Walk-Aggregating GRaph Neural NEtwork), for graph-level prediction. AWARE aggregates the walk information by means of weighting schemes at distinct levels (vertex-, walk-, and graph-level) in a principled manner. By virtue of the incorporated weighting schemes at these different levels, AWARE can emphasize the information important for prediction while diminishing the irrelevant ones—leading to representations that can improve learning performance.",
  "title": null,
  "collection": "Graph Representation Learning",
  "area": "Graphs"
}
{
  "name": "Pipelined Backpropagation",
  "full_name": "Pipelined Backpropagation",
  "description": "**Pipelined Backpropagation** is an asynchronous pipeline parallel training algorithm. It was first introduced by Petrowski et al (1993). It avoids fill and drain overhead by updating the weights without draining the pipeline first. This results in weight inconsistency, the use of different weights on the forward and backward passes for a given micro-batch. The weights used to produce a particular gradient may also have been updated when the gradient is applied, resulting in stale (or delayed) gradients. For these reasons PB resembles Asynchronous [SGD](https://paperswithcode.com/method/sgd) and is not equivalent to standard SGD. Finegrained pipelining increases the number of pipeline stages and hence increases the weight inconsistency and delay.",
  "title": "Pipelined Backpropagation at Scale: Training Large Models without Batches",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "InfoGAN",
  "full_name": "InfoGAN",
  "description": "**InfoGAN** is a type of generative adversarial network that modifies the [GAN](https://paperswithcode.com/method/gan) objective to\r\nencourage it to learn interpretable and meaningful representations. This is done by maximizing the\r\nmutual information between a fixed small subset of the GAN’s noise variables and the observations.\r\n\r\nFormally, InfoGAN is defined as a minimax game with a variational regularization of mutual information and the hyperparameter $\\lambda$:\r\n\r\n$$ \\min\\_{G, Q}\\max\\_{D}V\\_{INFOGAN}\\left(D, G, Q\\right) = V\\left(D, G\\right) - \\lambda{L}\\_{I}\\left(G, Q\\right) $$\r\n\r\nWhere $Q$ is an auxiliary distribution that approximates the posterior $P\\left(c\\mid{x}\\right)$ - the probability of the latent code $c$ given the data $x$ - and $L\\_{I}$ is the variational lower bound of the mutual information between the latent code and the observations.\r\n\r\nIn the practical implementation, there is another fully-connected layer to output parameters for the conditional distribution $Q$ (negligible computation ontop of regular GAN structures). Q is represented with a [softmax](https://paperswithcode.com/method/softmax) non-linearity for a categorical latent code. For a continuous latent code, the authors assume a factored Gaussian.",
  "title": "InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Conffusion",
  "full_name": "Confidence Intervals for Diffusion Models",
  "description": "Given a corrupted input image, Con\\textit{ffusion}, repurposes a pretrained diffusion model to generate lower and upper bounds around each reconstructed pixel. The true pixel value is guaranteed to fall within these bounds with probability $p$.",
  "title": "Conffusion: Confidence Intervals for Diffusion Models",
  "collection": "Image Restoration Models",
  "area": "Computer Vision"
}
{
  "name": "Rational Activation function",
  "full_name": "Rational Activation function",
  "description": "",
  "title": "Padé Activation Units: End-to-end Learning of Flexible Activation Functions in Deep Networks",
  "collection": "Adaptive Activation Functions",
  "area": "General"
}
{
  "name": "Temporal Distribution Matching",
  "full_name": "Temporal Distribution Matching",
  "description": "**Temporal Distribution Matching**, or **TDM**,  is a module used in the [AdaRNN](https://paperswithcode.com/method/adarnn) architecture to match the distributions of the discovered periods to build a time series prediction model $\\mathcal{M}$ Given the learned time periods, the TDM module is designed to learn the common knowledge shared by different periods via matching their distributions. Thus, the learned model $\\mathcal{M}$ is expected to generalize well on unseen test data compared with the methods which only rely on local or statistical information.\r\n\r\nWithin the context of AdaRNN, Temporal Distribution Matching aims to adaptively match the distributions between the [RNN](https://paperswithcode.com/methods/category/recurrent-neural-networks) cells of two periods while capturing the temporal dependencies. TDM introduces the importance vector $\\mathbf{\\alpha} \\in \\mathbb{R}^{\\hat{V}}$ to learn the relative importance of $V$ hidden states inside the RNN, where all the hidden states are weighted with a normalized $\\alpha$. Note that for each pair of periods, there is an $\\mathbf{\\alpha}$, and we omit the subscript if there is no confusion. In this way, we can dynamically reduce the distribution divergence of cross-periods.\r\n\r\nGiven a period-pair $\\left(\\mathcal{D}\\_{i}, \\mathcal{D}\\_{j}\\right)$, the loss of temporal distribution matching is formulated as:\r\n\r\n$$\r\n\\mathcal{L}\\_{t d m}\\left(\\mathcal{D}\\_{i}, \\mathcal{D}\\_{j} ; \\theta\\right)=\\sum_{t=1}^{V} \\alpha\\_{i, j}^{t} d\\left(\\mathbf{h}\\_{i}^{t}, \\mathbf{h}\\_{j}^{t} ; \\theta\\right)\r\n$$\r\n\r\nwhere $\\alpha\\_{i, j}^{t}$ denotes the distribution importance between the periods $\\mathcal{D}\\_{i}$ and $\\mathcal{D}\\_{j}$ at state $t$.\r\n\r\nAll the hidden states of the RNN can be easily computed by following the standard RNN computation. Denote by $\\delta(\\cdot)$ the computation of a next hidden state based on a previous state. The state computation can be formulated as\r\n\r\n$$\r\n\\mathbf{h}\\_{i}^{t}=\\delta\\left(\\mathbf{x}\\_{i}^{t}, \\mathbf{h}\\_{i}^{t-1}\\right)\r\n$$\r\n\r\nThe final objective of temporal distribution matching (one RNN layer) is:\r\n\r\n$$\r\n\\mathcal{L}(\\theta, \\mathbf{\\alpha})=\\mathcal{L}\\_{\\text {pred }}(\\theta)+\\lambda \\frac{2}{K(K-1)} \\sum\\_{i, j}^{i \\neq j} \\mathcal{L}\\_{t d m}\\left(\\mathcal{D}\\_{i}, \\mathcal{D}\\_{j} ; \\theta, \\mathbf{\\alpha}\\right)\r\n$$\r\n\r\nwhere $\\lambda$ is a trade-off hyper-parameter. Note that in the second term, we compute the average of the distribution distances of all pairwise periods. For computation, we take a mini-batch of $\\mathcal{D}_{i}$ and $\\mathcal{D}\\_{j}$ to perform forward operation in RNN layers and concatenate all hidden features. Then, we can perform TDM using the above equation.",
  "title": "AdaRNN: Adaptive Learning and Forecasting of Time Series",
  "collection": "Time Series Modules",
  "area": "Sequential"
}
{
  "name": "Transformer",
  "full_name": "Transformer",
  "description": "A **Transformer** is a model architecture that eschews recurrence and instead relies entirely on an [attention mechanism](https://paperswithcode.com/methods/category/attention-mechanisms-1) to draw global dependencies between input and output. Before Transformers, the dominant sequence transduction models were based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The Transformer also employs an encoder and decoder, but removing recurrence in favor of [attention mechanisms](https://paperswithcode.com/methods/category/attention-mechanisms-1) allows for significantly more parallelization than methods like [RNNs](https://paperswithcode.com/methods/category/recurrent-neural-networks) and [CNNs](https://paperswithcode.com/methods/category/convolutional-neural-networks).",
  "title": "Attention Is All You Need",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "InfoNCE",
  "full_name": "InfoNCE",
  "description": "**InfoNCE**, where NCE stands for Noise-Contrastive Estimation, is a type of contrastive loss function used for [self-supervised learning](https://paperswithcode.com/methods/category/self-supervised-learning).\r\n\r\nGiven a set $X = ${$x\\_{1}, \\dots, x\\_{N}$} of $N$ random samples containing one positive sample from $p\\left(x\\_{t+k}|c\\_{t}\\right)$ and $N − 1$ negative samples from the 'proposal' distribution $p\\left(x\\_{t+k}\\right)$, we optimize:\r\n\r\n$$ \\mathcal{L}\\_{N} = - \\mathbb{E}\\_{X}\\left[\\log\\frac{f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right)}{\\sum\\_{x\\_{j}\\in{X}}f\\_{k}\\left(x\\_{j}, c\\_{t}\\right)}\\right] $$\r\n\r\nOptimizing this loss will result in $f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right)$ estimating the density ratio, which is:\r\n\r\n$$ f\\_{k}\\left(x\\_{t+k}, c\\_{t}\\right) \\propto \\frac{p\\left(x\\_{t+k}|c\\_{t}\\right)}{p\\left(x\\_{t+k}\\right)} $$",
  "title": "Representation Learning with Contrastive Predictive Coding",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "Stochastic Depth",
  "full_name": "Stochastic Depth",
  "description": "**Stochastic Depth** aims to shrink the depth of a network during training, while\r\nkeeping it unchanged during testing. This is achieved by randomly dropping entire [ResBlocks](https://paperswithcode.com/method/residual-block) during training and bypassing their transformations through skip connections. \r\n\r\nLet $b\\_{l} \\in$ {$0, 1$} denote a Bernoulli random variable, which indicates whether the $l$th ResBlock is active ($b\\_{l} = 1$) or inactive ($b\\_{l} = 0$). Further, let us denote the “survival” probability of ResBlock $l$ as $p\\_{l} = \\text{Pr}\\left(b\\_{l} = 1\\right)$. With this definition we can bypass the $l$th ResBlock by multiplying its function $f\\_{l}$ with $b\\_{l}$ and we extend the update rule to:\r\n\r\n$$ H\\_{l} = \\text{ReLU}\\left(b\\_{l}f\\_{l}\\left(H\\_{l-1}\\right) + \\text{id}\\left(H\\_{l-1}\\right)\\right) $$\r\n\r\nIf $b\\_{l} = 1$, this reduces to the original [ResNet](https://paperswithcode.com/method/resnet) update and this ResBlock remains unchanged. If $b\\_{l} = 0$, the ResBlock reduces to the identity function, $H\\_{l} = \\text{id}\\left((H\\_{l}−1\\right)$.",
  "title": "Deep Networks with Stochastic Depth",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "MixText",
  "full_name": "MixText",
  "description": "**MixText** is a semi-supervised learning method for text classification, which uses a new data augmentation method called TMix. TMix creates a large amount of augmented training samples by interpolating text in hidden space. The technique leverages advances in data augmentation to guess low-entropy labels for unlabeled data, making them as easy to use as labeled data.",
  "title": "MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "RetinaNet-RS",
  "full_name": "RetinaNet-RS",
  "description": "**RetinaNet-RS** is an object detection model produced through a model scaling method based on changing the the input resolution and [ResNet](https://paperswithcode.com/method/resnet) backbone depth. For [RetinaNet](https://paperswithcode.com/method/retinanet), we scale up input resolution from 512 to 768 and the ResNet backbone depth from 50 to 152. As RetinaNet performs dense one-stage object detection, the authors find scaling up input resolution leads to large resolution feature maps hence more anchors to process. This results in a higher capacity dense prediction heads and expensive NMS. Scaling stops at input resolution 768 × 768 for RetinaNet.",
  "title": "Simple Training Strategies and Model Scaling for Object Detection",
  "collection": "One-Stage Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "ABC",
  "full_name": "Approximate Bayesian Computation",
  "description": "Class of methods in Bayesian Statistics where the posterior distribution is approximated over a rejection scheme on simulations because the likelihood function is intractable.\r\n\r\nDifferent parameters get sampled and simulated. Then a distance function is calculated to measure the quality of the simulation compared to data from real observations. Only simulations that fall below a certain threshold get accepted.\r\n\r\nImage source: [Kulkarni et al.](https://www.umass.edu/nanofabrics/sites/default/files/PDF_0.pdf)",
  "title": "Accelerating Simulation-based Inference with Emerging AI Hardware",
  "collection": "Approximate Inference",
  "area": "General"
}
{
  "name": "FcaNet",
  "full_name": "Frequency channel attention networks",
  "description": "FCANet contains a novel multi-spectral channel attention module. Given an input feature map $X \\in \\mathbb{R}^{C \\times H \\times W}$, multi-spectral channel attention first splits $X$ into many parts $x^{i} \\in \\mathbb{R}^{C' \\times H \\times W}$. Then it applies a 2D DCT to each part $x^{i}$. Note that a 2D DCT can use pre-processing results to reduce computation. After processing each part,  all results are concatenated into a vector. Finally, fully connected layers, ReLU activation and a sigmoid are used to get the attention vector as in an SE block. This can be formulated as:\r\n\\begin{align}\r\n    s = F_\\text{fca}(X, \\theta) & = \\sigma (W_{2} \\delta (W_{1}[(\\text{DCT}(\\text{Group}(X)))]))\r\n\\end{align}\r\n\\begin{align}\r\n    Y & = s  X\r\n\\end{align}\r\nwhere $\\text{Group}(\\cdot)$ indicates dividing the input into many groups and $\\text{DCT}(\\cdot)$ is the 2D discrete cosine transform. \r\n\r\nThis work based on information compression and discrete cosine transforms achieves excellent performance on the classification task.",
  "title": "FcaNet: Frequency Channel Attention Networks",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "CCNet",
  "full_name": "Criss-Cross Network",
  "description": "**Criss-Cross Network** (**CCNet**) aims to obtain full-image contextual information in an effective and efficient way. Concretely,\r\nfor each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. **CCNet** is with the following\r\nmerits: **1)** GPU memory friendly. Compared with the [non-local block](https://paperswithcode.com/method/non-local-block), the proposed recurrent criss-cross attention module requires 11× less GPU memory usage. **2)** High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block. **3)** The state-of-the-art performance.",
  "title": "CCNet: Criss-Cross Attention for Semantic Segmentation",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "KNN and IOU based verification",
  "full_name": "KNN and IOU based verification",
  "description": "**KNN and IoU-based Verification** is used to verify detections and choose between multiple detections of the same underlying object. It was originally used within the context of blood cell counting in medical images. To avoid this double counting problem, the KNN algorithm is applied in each platelet to determine its closest platelet and then using the intersection of union (IOU) between two platelets we calculate their extent of overlap. The authors allow 10% of the overlap between platelet and its closest platelet based on empirical observations. If the overlap is larger than that, they ignore that cell as a double count to get rid of spurious counting.",
  "title": "Machine learning approach of automatic identification and counting of blood cells",
  "collection": "Counting Methods",
  "area": "Computer Vision"
}
{
  "name": "AmoebaNet",
  "full_name": "AmoebaNet",
  "description": "**AmoebaNet** is a convolutional neural network found through regularized evolution architecture search. The search space is NASNet, which specifies a space of image classifiers with a fixed outer structure: a feed-forward stack of [Inception-like modules](https://paperswithcode.com/method/inception-module) called cells. The discovered architecture is shown to the right.",
  "title": "Regularized Evolution for Image Classifier Architecture Search",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "A3C",
  "full_name": "A3C",
  "description": "**A3C**, **Asynchronous Advantage Actor Critic**, is a policy gradient algorithm in reinforcement learning that maintains a policy $\\pi\\left(a\\_{t}\\mid{s}\\_{t}; \\theta\\right)$ and an estimate of the value\r\nfunction $V\\left(s\\_{t}; \\theta\\_{v}\\right)$. It operates in the forward view and uses a mix of $n$-step returns to update both the policy and the value-function. The policy and the value function are updated after every $t\\_{\\text{max}}$ actions or when a terminal state is reached. The update performed by the algorithm can be seen as $\\nabla\\_{\\theta{'}}\\log\\pi\\left(a\\_{t}\\mid{s\\_{t}}; \\theta{'}\\right)A\\left(s\\_{t}, a\\_{t}; \\theta, \\theta\\_{v}\\right)$ where $A\\left(s\\_{t}, a\\_{t}; \\theta, \\theta\\_{v}\\right)$ is an estimate of the advantage function given by:\r\n\r\n$$\\sum^{k-1}\\_{i=0}\\gamma^{i}r\\_{t+i} + \\gamma^{k}V\\left(s\\_{t+k}; \\theta\\_{v}\\right) - V\\left(s\\_{t}; \\theta\\_{v}\\right)$$\r\n\r\nwhere $k$ can vary from state to state and is upper-bounded by $t\\_{max}$.\r\n\r\nThe critics in A3C learn the value function while multiple actors are trained in parallel and get synced with global parameters every so often. The gradients are accumulated as part of training for stability - this is like parallelized stochastic gradient descent.\r\n\r\nNote that while the parameters $\\theta$ of the policy and $\\theta\\_{v}$ of the value function are shown as being separate for generality, we always share some of the parameters in practice. We typically use a convolutional neural network that has one [softmax](https://paperswithcode.com/method/softmax) output for the policy $\\pi\\left(a\\_{t}\\mid{s}\\_{t}; \\theta\\right)$ and one linear output for the value function $V\\left(s\\_{t}; \\theta\\_{v}\\right)$, with all non-output layers shared.",
  "title": "Asynchronous Methods for Deep Reinforcement Learning",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "PREDATOR",
  "full_name": "PREDATOR",
  "description": "**PREDATOR** is a model for pairwise point-cloud registration with deep attention to the overlap region. Its key novelty is an overlap-attention block for early information exchange between the latent encodings of the two point clouds. In this way the subsequent decoding of the latent representations into per-point features is conditioned on the respective other point cloud, and thus can predict which points are not only salient, but also lie in the overlap region between the two point clouds.",
  "title": "PREDATOR: Registration of 3D Point Clouds with Low Overlap",
  "collection": "Point Cloud Models",
  "area": "Computer Vision"
}
{
  "name": "TRPO",
  "full_name": "Trust Region Policy Optimization",
  "description": "**Trust Region Policy Optimization**, or **TRPO**, is a policy gradient method in reinforcement learning that avoids parameter updates that change the policy too much with a KL divergence constraint on the size of the policy update at each iteration.\r\n\r\nTake the case of off-policy reinforcement learning, where the policy $\\beta$ for collecting trajectories on rollout workers is different from the policy $\\pi$ to optimize for. The objective function in an off-policy model measures the total advantage over the state visitation distribution and actions, while the mismatch between the training data distribution and the true policy state distribution is compensated with an importance sampling estimator:\r\n\r\n$$ J\\left(\\theta\\right) = \\sum\\_{s\\in{S}}p^{\\pi\\_{\\theta\\_{old}}}\\sum\\_{a\\in\\mathcal{A}}\\left(\\pi\\_{\\theta}\\left(a\\mid{s}\\right)\\hat{A}\\_{\\theta\\_{old}}\\left(s, a\\right)\\right) $$\r\n\r\n$$ J\\left(\\theta\\right) = \\sum\\_{s\\in{S}}p^{\\pi\\_{\\theta\\_{old}}}\\sum\\_{a\\in\\mathcal{A}}\\left(\\beta\\left(a\\mid{s}\\right)\\frac{\\pi\\_{\\theta}\\left(a\\mid{s}\\right)}{\\beta\\left(a\\mid{s}\\right)}\\hat{A}\\_{\\theta\\_{old}}\\left(s, a\\right)\\right) $$\r\n\r\n$$ J\\left(\\theta\\right) = \\mathbb{E}\\_{s\\sim{p}^{\\pi\\_{\\theta\\_{old}}}, a\\sim{\\beta}} \\left(\\frac{\\pi\\_{\\theta}\\left(a\\mid{s}\\right)}{\\beta\\left(a\\mid{s}\\right)}\\hat{A}\\_{\\theta\\_{old}}\\left(s, a\\right)\\right)$$\r\n\r\nWhen training on policy, theoretically the policy for collecting data is same as the policy that we want to optimize. However, when rollout workers and optimizers are running in parallel asynchronously, the behavior policy can get stale. TRPO considers this subtle difference: It labels the behavior policy as $\\pi\\_{\\theta\\_{old}}\\left(a\\mid{s}\\right)$ and thus the objective function becomes:\r\n\r\n$$ J\\left(\\theta\\right) = \\mathbb{E}\\_{s\\sim{p}^{\\pi\\_{\\theta\\_{old}}}, a\\sim{\\pi\\_{\\theta\\_{old}}}} \\left(\\frac{\\pi\\_{\\theta}\\left(a\\mid{s}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\mid{s}\\right)}\\hat{A}\\_{\\theta\\_{old}}\\left(s, a\\right)\\right)$$\r\n\r\nTRPO aims to maximize the objective function $J\\left(\\theta\\right)$ subject to a trust region constraint which enforces the distance between old and new policies measured by KL-divergence to be small enough, within a parameter $\\delta$:\r\n\r\n$$ \\mathbb{E}\\_{s\\sim{p}^{\\pi\\_{\\theta\\_{old}}}} \\left[D\\_{KL}\\left(\\pi\\_{\\theta\\_{old}}\\left(.\\mid{s}\\right)\\mid\\mid\\pi\\_{\\theta}\\left(.\\mid{s}\\right)\\right)\\right] \\leq \\delta$$",
  "title": "Trust Region Policy Optimization",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "RetinaMask",
  "full_name": "RetinaMask",
  "description": "**RetinaMask** is a one-stage object detection method that improves upon [RetinaNet](https://paperswithcode.com/method/retinanet) by adding the task of instance mask prediction during training, as well as an [adaptive loss](https://paperswithcode.com/method/adaptive-loss) that improves robustness to parameter choice during training, and including more difficult examples in training.",
  "title": "RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "SGPCS",
  "full_name": "Self-training Guided Prototypical Cross-domain Self-supervised learning",
  "description": "Model to adapt:\r\nWe use Ultra Fast Structure-aware Deep Lane Detection (UFLD) as baseline and strictly adopt its training scheme and hyperparameters. UFLD treats lane detection as a row-based classification problem and utilizes the row anchors defined by TuSimple.\r\n\r\nUnsupervised Domain Adaptation with SGPCS:\r\nSGPCS builds upon PCS  and performs in-domain contrastive learning and crossdomain self-supervised learning via cluster prototypes. \r\n\r\nWe reformulate the pseudo label selection mechanism from SGADA. For each lane, we select the highest confidence value from the griding cells of each row anchor. Based on their griding cell position, the confidence values are divided into two cases: absent lane points and present lane points. Thereby, the last griding cell represents absent lane points as in. For each case, we calculate the mean confidence over the corresponding lanes. We then use the thresholds defined by SGADA to decide whether the prediction is treated as a pseudo label.\r\n\r\nOur overall objective function comprises the in-domain and cross-domain loss from PCS, the losses defined by UFLD, and our adopted pseudo loss from SGADA. We adjust the momentum for memory bank feature updates to 0.5 and use spherical K-means with K = 2,500 to cluster them into prototypes.",
  "title": "CARLANE: A Lane Detection Benchmark for Unsupervised Domain Adaptation from Simulation to multiple Real-World Domains",
  "collection": "Domain Adaptation",
  "area": "General"
}
{
  "name": "ECA-Net",
  "full_name": "ECA-Net",
  "description": "An **ECA-Net** is a type of convolutional neural network that utilises an [Efficient Channel Attention](https://paperswithcode.com/method/efficient-channel-attention) module.",
  "title": "ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "LapEigen",
  "full_name": "Laplacian EigenMap",
  "description": "",
  "title": "Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "E2EAdaptiveDistTraining",
  "full_name": "End-to-end Adaptive Distributed Training",
  "description": "Distributed training has become a pervasive and effective approach for training a large neural network\r\n(NN) model with processing massive data. However, it is very challenging to satisfy requirements\r\nfrom various NN models, diverse computing resources, and their dynamic changes during a training\r\njob. In this study, we design our distributed training framework in a systematic end-to-end view to\r\nprovide the built-in adaptive ability for different scenarios, especially for industrial applications and\r\nproduction environments, by fully considering resource allocation, model partition, task placement,\r\nand distributed execution. Based on the unified distributed graph and the unified cluster object,\r\nour adaptive framework is equipped with a global cost model and a global planner, which can\r\nenable arbitrary parallelism, resource-aware placement, multi-mode execution, fault-tolerant, and\r\nelastic distributed training. The experiments demonstrate that our framework can satisfy various\r\nrequirements from the diversity of applications and the heterogeneity of resources with highly\r\ncompetitive performance.",
  "title": "End-to-end Adaptive Distributed Training on PaddlePaddle",
  "collection": "2D Parallel Distributed Methods",
  "area": "General"
}
{
  "name": "gSDE",
  "full_name": "Generalized State-Dependent Exploration",
  "description": "**Generalized State-Dependent Exploration**, or **gSDE**, is an exploration method for reinforcement learning that uses more general features and re-sampling the noise periodically. \r\n\r\nState-Dependent Exploration (SDE) is an intermediate solution for exploration that consists in adding noise as a function of the state $s\\_{t}$, to the deterministic action $\\mu\\left(\\mathbf{s}\\_{t}\\right)$. At the beginning of an episode, the parameters $\\theta\\_{\\epsilon}$ of that exploration function are drawn from a Gaussian distribution. The resulting action $\\mathbf{a}\\_{t}$ is as follows:\r\n\r\n$$\r\n\\mathbf{a}\\_{t}=\\mu\\left(\\mathbf{s}\\_{t} ; \\theta\\_{\\mu}\\right)+\\epsilon\\left(\\mathbf{s}\\_{t} ; \\theta\\_{\\epsilon}\\right), \\quad \\theta\\_{\\epsilon} \\sim \\mathcal{N}\\left(0, \\sigma^{2}\\right)\r\n$$\r\n\r\nThis episode-based exploration is smoother and more consistent than the unstructured step-based exploration. Thus, during one episode, instead of oscillating around a mean value, the action a for a given state $s$ will be the same.\r\n\r\nIn the case of a linear exploration function $\\epsilon\\left(\\mathbf{s} ; \\theta\\_{\\epsilon}\\right)=\\theta\\_{\\epsilon} \\mathbf{s}$, by operation on Gaussian distributions, Rückstieß et al. show that the action element $\\mathbf{a}\\_{j}$ is normally distributed:\r\n\r\n$$\r\n\\pi]_{j}\\left(\\mathbf{a}\\_{j} \\mid \\mathbf{s}\\right) \\sim \\mathcal{N}\\left(\\mu\\_{j}(\\mathbf{s}), \\hat{\\sigma\\_{j}}^{2}\\right)\r\n$$\r\n\r\nwhere $\\hat{\\sigma}$ is a diagonal matrix with elements $\\hat{\\sigma}\\_{j}=\\sqrt{\\sum\\_{i}\\left(\\sigma\\_{i j} \\mathbf{s}\\_{i}\\right)^{2}}$.\r\n\r\nBecause we know the policy distribution, we can obtain the derivative of the log-likelihood $\\log \\pi(\\mathbf{a} \\mid \\mathbf{s})$ with respect to the variance $\\sigma$ :\r\n\r\n$$\r\n\\frac{\\partial \\log \\pi(\\mathbf{a} \\mid \\mathbf{s})}{\\partial \\sigma_{i j}}=\\frac{\\left(\\mathbf{a}\\_{j}-\\mu\\_{j}\\right)^{2}-\\hat{\\sigma\\_{j}}^{2}}{\\hat{\\sigma}\\_{j}^{3}} \\frac{\\mathbf{s}\\_{i}^{2} \\sigma\\_{i j}}{\\hat{\\sigma_{j}}}\r\n$$\r\n\r\nThis can be easily plugged into the likelihood ratio gradient estimator, which allows to adapt $\\sigma$ during training. SDE is therefore compatible with standard policy gradient methods, while addressing most shortcomings of the unstructured exploration.\r\n\r\nFor gSDE, two improvements are suggested:\r\n\r\n1. We sample the parameters $\\theta\\_{\\epsilon}$ of the exploration function every $n$ steps instead of every episode.\r\n2. Instead of the state s, we can in fact use any features. We chose policy features $\\mathbf{z}\\_{\\mu}\\left(\\mathbf{s} ; \\theta\\_{\\mathbf{z}\\_{\\mu}}\\right)$ (last layer before the deterministic output $\\left.\\mu(\\mathbf{s})=\\theta\\_{\\mu} \\mathbf{z}\\_{\\mu}\\left(\\mathbf{s} ; \\theta_{\\mathbf{z}\\_{\\mu}}\\right)\\right)$ as input to the noise function $\\epsilon\\left(\\mathbf{s} ; \\theta\\_{\\epsilon}\\right)=\\theta\\_{\\epsilon} \\mathbf{z}\\_{\\mu}(\\mathbf{s})$",
  "title": "Smooth Exploration for Robotic Reinforcement Learning",
  "collection": "Exploration Strategies",
  "area": "Reinforcement Learning"
}
{
  "name": "EdgeFlow",
  "full_name": "EdgeFlow",
  "description": "**EdgeFlow** is an interactive segmentation architecture that fully utilizes interactive information of user clicks with edge-guided flow. Edge guidance is the idea that interactive segmentation improves segmentation masks progressively with user clicks. Based on user clicks, an edge mask scheme is used, which takes the object edges estimated from the previous iteration as prior information, instead of direct mask estimation (if the previous mask is used as input, poor segmentation results could result).\r\n\r\nThe architecture consists of a coarse-to-fine network including CoarseNet and FineNet. For CoarseNet, [HRNet](https://paperswithcode.com/method/hrnet)-18+OCR is utilized as the base segmentation model and the edge-guided flow is appended to deal with interactive information. For FineNet, three [atrous convolution](https://paperswithcode.com/method/dilated-convolution) blocks are utilized to refine the coarse masks.",
  "title": "EdgeFlow: Achieving Practical Interactive Segmentation with Edge-Guided Flow",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "SABL",
  "full_name": "Side-Aware Boundary Localization",
  "description": "**Side-Aware Boundary Localization (SABL)** is a methodology for precise localization in object detection where each side of the bounding box is respectively localized with a dedicated network branch. Empirically, the authors observe that when they manually annotate a bounding box for an object, it is often much easier to align each side of the box to the object boundary than to move the\r\nbox as a whole while tuning the size. Inspired by this observation, in SABL each side of the bounding box is respectively positioned based on its surrounding context. \r\n\r\nAs shown in the Figure, the authors devise a bucketing scheme to improve the localization precision. For each side of a bounding box, this scheme divides the target space into multiple buckets, then determines the bounding box via two steps. Specifically, it first searches for the correct bucket, i.e., the one in which the boundary resides. Leveraging the centerline of the selected buckets as a\r\ncoarse estimate, fine regression is then performed by predicting the offsets. This scheme allows very precise localization even in the presence of displacements with large variance. Moreover, to preserve precisely localized bounding boxes in the non-maximal suppression procedure, the authors also propose to adjust the classification score based on the bucketing confidences, which leads to further performance gains.",
  "title": "Side-Aware Boundary Localization for More Precise Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "Attention Dropout",
  "full_name": "Attention Dropout",
  "description": "**Attention Dropout** is a type of [dropout](https://paperswithcode.com/method/dropout) used in attention-based architectures, where elements are randomly dropped out of the [softmax](https://paperswithcode.com/method/softmax) in the attention equation. For example, for scaled-dot product attention, we would drop elements from the first term:\r\n\r\n$$ {\\text{Attention}}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^{T}}{\\sqrt{d_k}}\\right)V $$",
  "title": null,
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Tanh Activation",
  "full_name": "Tanh Activation",
  "description": "**Tanh Activation** is an activation function used for neural networks:\r\n\r\n$$f\\left(x\\right) = \\frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$$\r\n\r\nHistorically, the tanh function became preferred over the [sigmoid function](https://paperswithcode.com/method/sigmoid-activation) as it gave better performance for multi-layer neural networks. But it did not solve the vanishing gradient problem that sigmoids suffered, which was tackled more effectively with the introduction of [ReLU](https://paperswithcode.com/method/relu) activations.\r\n\r\nImage Source: [Junxi Feng](https://www.researchgate.net/profile/Junxi_Feng)",
  "title": null,
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "MLFPN",
  "full_name": "MLFPN",
  "description": "**Multi-Level Feature Pyramid Network**, or **MLFPN**, is a feature pyramid block used in object detection models, notably [M2Det](https://paperswithcode.com/method/m2det). We first fuse multi-level features (i.e. multiple layers) extracted by a backbone as a base feature, and then feed it into a block of alternating joint Thinned U-shape Modules ([TUM](https://paperswithcode.com/method/tum)) and Feature Fusion Modules (FFM) to extract more representative, multi-level multi-scale features. Finally, we gather up the feature maps with equivalent scales to construct the final feature pyramid for object detection. Decoder layers that form the final feature pyramid are much deeper than the layers in the backbone, namely, they are more representative. Moreover, each feature map in the final feature pyramid consists of the decoder layers from multiple levels. Hence, the feature pyramid block is called Multi-Level Feature Pyramid Network (MLFPN).",
  "title": "M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "SwiGLU",
  "full_name": "SwiGLU",
  "description": "**SwiGLU** is an activation function which is a variant of [GLU](https://paperswithcode.com/method/glu). The definition is as follows:\r\n\r\n$$ \\text{SwiGLU}\\left(x, W, V, b, c, \\beta\\right) = \\text{Swish}\\_{\\beta}\\left(xW + b\\right) \\otimes \\left(xV + c\\right) $$",
  "title": "GLU Variants Improve Transformer",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "RESCAL",
  "full_name": "RESCAL",
  "description": "RESCAL",
  "title": "A Three-Way Model for Collective Learning on Multi-Relational Data",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "MViT",
  "full_name": "Multiscale Vision Transformer",
  "description": "**Multiscale Vision Transformer**, or **MViT**, is a [transformer](https://paperswithcode.com/method/transformer) architecture for modeling visual data such as images and videos. Unlike conventional transformers, which maintain a constant channel capacity and resolution throughout the network, Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features.",
  "title": "Multiscale Vision Transformers",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "RMS Pooling",
  "full_name": "Root-of-Mean-Squared Pooling",
  "description": "**RMS Pooling** is a pooling operation that calculates the square mean root for patches of a feature map, and uses it to create a downsampled (pooled) feature map.  It is usually used after a convolutional layer.\r\n\r\n$$ z_{j} = \\sqrt{\\frac{1}{M}\\sum^{M}_{i=1}u{ij}^{2}} $$",
  "title": null,
  "collection": "Pooling Operations",
  "area": "Computer Vision"
}
{
  "name": "APA",
  "full_name": "Adaptive Pseudo Augmentation",
  "description": "",
  "title": "Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "TinaFace",
  "full_name": "TinaFace",
  "description": "**TinaFace** is a type of face detection method that is based on generic object detection. It consists of (a) Feature Extractor: [ResNet](https://paperswithcode.com/method/resnet)-50 and 6 level [Feature Pyramid Network](https://www.paperswithcode.com/method/fpn) to extract the multi-scale features of input image; (b) an Inception block to enhance receptive field; (c) Classification Head: 5 layers [FCN](https://paperswithcode.com/method/fcn) for classification of anchors; (d) Regression Head: 5 layers [FCN](https://paperswithcode.com/method/fcn) for regression of anchors to ground-truth objects boxes; (e) IoU Aware Head: a single convolutional layer for IoU prediction.",
  "title": "TinaFace: Strong but Simple Baseline for Face Detection",
  "collection": "Face Detection Models",
  "area": "Computer Vision"
}
{
  "name": "FastSpeech 2",
  "full_name": "FastSpeech 2",
  "description": "**FastSpeech2** is a text-to-speech model that aims to improve upon FastSpeech by better solving the one-to-many mapping problem in TTS, i.e., multiple speech variations corresponding to the same text. It attempts to solve this problem by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, in FastSpeech 2, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference.\r\n\r\nThe encoder converts the phoneme embedding sequence into the phoneme hidden sequence, and then the variance adaptor adds different variance information such as duration, pitch and energy into the hidden sequence, finally the mel-spectrogram decoder converts the adapted hidden sequence into mel-spectrogram sequence in parallel. FastSpeech 2 uses a feed-forward [Transformer](https://paperswithcode.com/method/transformer) block, which is a stack of [self-attention](https://paperswithcode.com/method/multi-head-attention) and 1D-[convolution](https://paperswithcode.com/method/convolution) as in FastSpeech, as the basic structure for the encoder and mel-spectrogram decoder.",
  "title": "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech",
  "collection": "Text-to-Speech Models",
  "area": "Audio"
}
{
  "name": "Sparse R-CNN",
  "full_name": "Sparse R-CNN",
  "description": "**Sparse R-CNN** is a purely sparse method for object detection in images, without object positional candidates enumerating\r\non all(dense) image grids nor object queries interacting with global(dense) image feature.\r\n\r\nAs shown in the Figure, object candidates are given with a fixed small set of learnable bounding boxes represented by 4-d coordinate. For the example of the COCO dataset, 100 boxes and 400 parameters are needed in total, rather than the predicted ones from hundreds of thousands of candidates in a Region Proposal Network ([RPN](https://paperswithcode.com/method/rpn)). These sparse candidates are used as proposal boxes to extract the feature of Region of Interest (RoI) by [RoIPool](https://paperswithcode.com/method/roi-pooling) or [RoIAlign](https://paperswithcode.com/method/roi-align).",
  "title": "Sparse R-CNN: End-to-End Object Detection with Learnable Proposals",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "ViLT",
  "full_name": "Vision-and-Langauge Transformer",
  "description": "ViLT is a minimal vision-and-language pre-training transformer model where processing of visual inputs is simplified to just the same convolution-free manner that text inputs are processed. The model-specific components of ViLT require less computation than the transformer component for multimodal interactions. ViLTThe model is pre-trained on the following objectives: image text matching, masked language modeling, and word patch alignment.",
  "title": "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "Vision-aided GAN",
  "full_name": "Vision-aided GAN",
  "description": "Vision-aided GAN training involves using pretrained computer vision models in an ensemble of discriminators to improve GAN performance. Linear separability between real and fake samples in pretrained model embeddings is used as a measure to choose the most accurate pretrained models for a dataset.",
  "title": "Ensembling Off-the-shelf Models for GAN Training",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Hybrid AWT",
  "full_name": "Hybrid Air-Water Temperature Difference",
  "description": "The hybrid model couples existing macro-meteorological models developed for similar microclimates along with some minimal amount of locally-acquired meteorological and $C_n^2$ data. The hybrid model framework consists of two components, a baseline macro-meteorological model and a machine learning model trained on that baseline macro-meteorological model’s residual error over the locally-acquired training measurements.",
  "title": "Hybrid Optical Turbulence Models Using Machine Learning and Local Measurements",
  "collection": "Non-Parametric Regression",
  "area": "General"
}
{
  "name": "Octave Convolution",
  "full_name": "Octave Convolution",
  "description": "An **Octave Convolution (OctConv)** stores and process feature maps that vary spatially “slower” at a lower spatial resolution reducing both memory and computation cost. It takes in feature maps containing tensors of two frequencies one octave apart, and extracts information directly from the\r\nlow-frequency maps without the need of decoding it back to the high-frequency. The motivation is that in natural images, information is conveyed at different frequencies where higher frequencies are usually encoded with fine details and lower frequencies are usually encoded with global structures.",
  "title": "Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "HypE",
  "full_name": "Hyperboloid Embeddings",
  "description": "Hyperboloid Embeddings (HypE) is a novel self-supervised dynamic reasoning framework, that utilizes positive first-order existential queries on a KG to learn representations of its entities and relations as hyperboloids in a Poincaré ball. HypE models the positive first-order queries as geometrical translation (t), intersection ($\\cap$), and union ($\\cup$). For the problem of KG reasoning in real-world datasets, the proposed HypE model significantly outperforms the state-of-the art results. HypE is also applied to an anomaly detection task on a popular e-commerce website product taxonomy as well as hierarchically organized web articles and demonstrate significant performance improvements compared to existing baseline methods. Finally, HypE embeddings can also be visualized in a Poincaré ball to clearly interpret and comprehend the representation space.",
  "title": "Self-Supervised Hyperboloid Representations from Logical Queries over Knowledge Graphs",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "CSPPeleeNet",
  "full_name": "CSPPeleeNet",
  "description": "**CSPPeleeNet** is a convolutional neural network and object detection backbone  where we apply the Cross Stage Partial Network (CSPNet) approach to [PeleeNet](https://paperswithcode.com/method/peleenet). The CSPNet partitions the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network.",
  "title": "CSPNet: A New Backbone that can Enhance Learning Capability of CNN",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Large-scale spectral clustering",
  "full_name": "Large-scale spectral clustering",
  "description": "# [Spectral Clustering](https://paperswithcode.com/method/spectral-clustering)\r\n\r\nSpectral clustering aims to partition the data points into $k$ clusters using the spectrum of the graph Laplacians \r\nGiven a dataset $X$ with $N$ data points, spectral clustering algorithm first constructs similarity matrix ${W}$, where ${w_{ij}}$ indicates the similarity between data points $x_i$ and $x_j$ via a similarity measure metric.\r\n\r\nLet $L=D-W$, where $L$ is called graph Laplacian and ${D}$ is a diagonal matrix with $d_{ii} = \\sum_ {j=1}^n w_{ij}$.\r\nThe objective function of spectral clustering can be formulated based on the graph Laplacian as follow:\r\n\\begin{equation}\r\n  \\label{eq:SC_obj}\r\n  {\\max_{{U}}  \\operatorname{tr}\\left({U}^{T} {L} {U}\\right)}, \\\\ {\\text { s.t. } \\quad {U}^{T} {{U}={I}}},\r\n\\end{equation}\r\nwhere $\\operatorname{tr(\\cdot)}$ denotes the trace norm of a matrix.\r\nThe rows of matrix ${U}$ are the low dimensional embedding of the original data points.\r\nGenerally, spectral clustering computes ${U}$ as the bottom $k$ eigenvectors of ${L}$, and finally applies $k$-means on ${U}$ to obtain the clustering results.\r\n\r\n\r\n# Large-scale Spectral Clustering\r\n\r\nTo capture the relationship between all data points in $X$, an $N\\times N$ similarity matrix is needed to be constructed in conventional spectral clustering, which costs $O(N^2d)$ time and $O(N^2)$ memory and is not feasible for large-scale clustering tasks.\r\nInstead of a full similarity matrix, many accelerated spectral clustering methods are using a similarity sub-matrix to represent each data points by the cross-similarity between data points and a set of representative data points (i.e., landmarks) via some similarity measures, as\r\n\\begin{equation}\r\n    \\label{eq: cross-similarity}\r\n    B = \\Phi(X,R),\r\n\\end{equation}\r\nwhere $R = \\{r_1,r_2,\\dots, r_p \\}$ ($p \\ll N$) is a set of landmarks with the same dimension to $X$, $\\Phi(\\cdot)$ indicate a similarity measure metric, and $B\\in \\mathbb{R}^{N\\times p}$ is the similarity sub-matrix to represent the $X \\in \\mathbb{R}^{N\\times d}$ with respect to the $R\\in \\mathbb{R}^{p\\times d}$.\r\n\r\nFor large-scale spectral clustering using such similarity matrix,\r\na symmetric similarity matrix $W$ can be designed as \r\n\\begin{equation}\r\n  \\label{eq: WusedB }\r\n  W=\\left[\\begin{array}{ll}\r\n      \\mathbf{0} & B         ; \\\\\r\n      B^{T}      & \\mathbf{0}\r\n    \\end{array}\\right].\r\n\\end{equation}\r\nThe size of matrix $W$ is $(N+p)\\times (N+p)$. \r\nTaking the advantage of the bipartite structure, some fast eigen-decomposition methods can then  be used to obtain the spectral embedding.\r\nFinally, $k$-means is conducted on the embedding to obtain clustering results.\r\n\r\nThe clustering result is directly related to the quality of $B$ that consists of the similarities between data points and landmarks.\r\nThus, the performance of landmark selection is crucial to the clustering result.",
  "title": "Divide-and-conquer based Large-Scale Spectral Clustering",
  "collection": "Clustering",
  "area": "General"
}
{
  "name": "MagFace",
  "full_name": "MagFace",
  "description": "**MagFace** is a category of losses for face recognition that learn a universal feature embedding whose magnitude can measure the quality of a given face. Under the new loss, it can be proven that the magnitude of the feature embedding monotonically increases if the subject is more likely to be recognized. In addition, MagFace introduces an adaptive mechanism to learn a well-structured within-class feature distributions by pulling easy samples to class centers while pushing hard samples away. For face recognition, MagFace helps prevent model overfitting on noisy and low-quality samples by an adaptive mechanism to learn well-structured within-class feature distributions -- by pulling easy samples to class centers while pushing hard samples away.",
  "title": "MagFace: A Universal Representation for Face Recognition and Quality Assessment",
  "collection": "Face Recognition Models",
  "area": "Computer Vision"
}
{
  "name": "BigBird",
  "full_name": "BigBird",
  "description": "**BigBird** is a [Transformer](https://paperswithcode.com/method/transformer) with a sparse attention mechanism that reduces the quadratic dependency of self-attention to linear in the number of tokens. BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.  In particular, BigBird consists of three main parts:\r\n\r\n- A set of $g$ global tokens attending on all parts of the sequence.\r\n- All tokens attending to a set of $w$ local neighboring tokens.\r\n- All tokens attending to a set of $r$ random tokens.\r\n\r\nThis leads to a high performing attention mechanism scaling to much longer sequence lengths (8x).",
  "title": "Big Bird: Transformers for Longer Sequences",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Mask Scoring R-CNN",
  "full_name": "Mask Scoring R-CNN",
  "description": "**Mask Scoring R-CNN** is a Mask RCNN with MaskIoU Head, which takes the instance feature and the predicted mask together as input, and predicts the IoU between input mask and ground truth mask.",
  "title": "Mask Scoring R-CNN",
  "collection": "Instance Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "SwAV",
  "full_name": "Swapping Assignments between Views",
  "description": "**SwaV**, or **Swapping Assignments Between Views**, is a self-supervised learning approach that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, it simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or views) of the same image, instead of comparing features directly as in contrastive learning. Simply put, SwaV uses a swapped prediction mechanism where we predict the cluster assignment of a view from the representation of another view.",
  "title": "Unsupervised Learning of Visual Features by Contrasting Cluster Assignments",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "MaskFlownet",
  "full_name": "MaskFlownet",
  "description": "**MaskFlownet** is an asymmetric occlusion-aware feature matching module, which can learn a rough occlusion mask that filters useless (occluded) areas immediately after feature warping without any explicit supervision. The learned occlusion mask can be further fed into a subsequent network cascade with dual feature pyramids.",
  "title": "MaskFlownet: Asymmetric Feature Matching with Learnable Occlusion Mask",
  "collection": "Feature Matching",
  "area": "General"
}
{
  "name": "Deformable Position-Sensitive RoI Pooling",
  "full_name": "Deformable Position-Sensitive RoI Pooling",
  "description": "**Deformable Position-Sensitive RoI Pooling** is similar to PS RoI Pooling but it adds an offset to each bin position in the regular bin partition. Offset learning follows the “fully convolutional” spirit. In the top branch, a convolutional layer generates the full spatial resolution offset fields. For each RoI (also for each class), PS RoI pooling is applied on such fields to obtain normalized offsets, which are then transformed to the real offsets, in the same way as in deformable RoI pooling.",
  "title": "Deformable Convolutional Networks",
  "collection": "RoI Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "ALI",
  "full_name": "Adversarially Learned Inference",
  "description": "**Adversarially Learned Inference (ALI)** is a generative modelling approach that casts the learning of both an inference machine (or encoder) and a deep directed generative model (or decoder) in an GAN-like adversarial framework. A discriminator is trained to discriminate joint samples of the data and the corresponding latent variable from the encoder (or approximate posterior) from joint samples from the decoder while in opposition, the encoder and the decoder are trained together to fool the discriminator. Not is the discriminator asked to distinguish synthetic samples from real data, but it is required it to distinguish between two joint distributions over the data space and the latent variables.\r\n\r\nAn ALI differs from a [GAN](https://paperswithcode.com/method/gan) in two ways:\r\n\r\n- The generator has two components: the encoder, $G\\_{z}\\left(\\mathbf{x}\\right)$, which maps data samples $x$ to $z$-space, and the decoder $G\\_{x}\\left(\\mathbf{z}\\right)$, which maps samples from the prior $p\\left(\\mathbf{z}\\right)$ (a source of noise) to the input space.\r\n- The discriminator is trained to distinguish between joint pairs $\\left(\\mathbf{x}, \\tilde{\\mathbf{z}} = G\\_{\\mathbf{x}}\\left(\\mathbf{x}\\right)\\right)$ and $\\left(\\tilde{\\mathbf{x}} =\r\nG\\_{x}\\left(\\mathbf{z}\\right), \\mathbf{z}\\right)$, as opposed to marginal samples $\\mathbf{x} \\sim q\\left(\\mathbf{x}\\right)$ and $\\tilde{\\mathbf{x}} ∼ p\\left(\\mathbf{x}\\right)$.",
  "title": "Adversarially Learned Inference",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Dilated Sliding Window Attention",
  "full_name": "Dilated Sliding Window Attention",
  "description": "**Dilated Sliding Window Attention** is an attention pattern for attention-based models. It was proposed as part of the [Longformer](https://paperswithcode.com/method/longformer) architecture. It is motivated by the fact that non-sparse attention in the original [Transformer](https://paperswithcode.com/method/transformer) formulation has a [self-attention component](https://paperswithcode.com/method/scaled) with $O\\left(n^{2}\\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. \r\n\r\nCompared to a [Sliding Window Attention](https://paperswithcode.com/method/sliding-window-attention) pattern, we can further increase the receptive field without increasing computation by making the sliding window \"dilated\". This is analogous to [dilated CNNs](https://paperswithcode.com/method/dilated-convolution) where the window has gaps of size dilation $d$. Assuming a fixed $d$ and $w$ for all layers, the receptive field is $l × d × w$, which can reach tens of thousands of tokens even for small values of $d$.",
  "title": "Longformer: The Long-Document Transformer",
  "collection": "Attention Patterns",
  "area": "Natural Language Processing"
}
{
  "name": "Channel Shuffle",
  "full_name": "Channel Shuffle",
  "description": "**Channel Shuffle** is an operation to help information flow across feature channels in convolutional neural networks. It was used as part of the [ShuffleNet](https://paperswithcode.com/method/shufflenet) architecture. \r\n\r\nIf we allow a group [convolution](https://paperswithcode.com/method/convolution) to obtain input data from different groups, the input and output channels will be fully related. Specifically, for the feature map generated from the previous group layer, we can first divide the channels in each group into several subgroups, then feed each group in the next layer with different subgroups. \r\n\r\nThe above can be efficiently and elegantly implemented by a channel shuffle operation: suppose a convolutional layer with $g$ groups whose output has $g \\times n$ channels; we first reshape the output channel dimension into $\\left(g, n\\right)$, transposing and then flattening it back as the input of next layer. Channel shuffle is also differentiable, which means it can be embedded into network structures for end-to-end training.",
  "title": "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices",
  "collection": "Miscellaneous Components",
  "area": "General"
}
{
  "name": "Channel-wise Cross Fusion Transformer",
  "full_name": "Channel-wise Cross Fusion Transformer",
  "description": "**Channel-wise Cross Fusion Transformer** is a module used in the [UCTransNet](https://paperswithcode.com/method/uctransnet) architecture for semantic segmentation. It fuses the multi-scale encoder features with the advantage of the long dependency modeling in the [Transformer](https://paperswithcode.com/method/transformer). The [CCT](https://paperswithcode.com/method/cct) module consists of three steps: multi-scale feature embedding, multi-head [channel-wise cross attention](https://paperswithcode.com/method/channel-wise-cross-attention) and Multi-Layer Perceptron (MLP).",
  "title": "UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer",
  "collection": "Semantic Segmentation Modules",
  "area": "Computer Vision"
}
{
  "name": "SegNet",
  "full_name": "SegNet",
  "description": "**SegNet** is a semantic segmentation model. This core trainable segmentation architecture consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the\r\nVGG16 network. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature maps. Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to\r\nperform non-linear upsampling.",
  "title": "SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "SCAN-clustering",
  "full_name": "Semantic Clustering by Adopting Nearest Neighbours",
  "description": "SCAN automatically groups images into semantically meaningful clusters when ground-truth annotations are absent. SCAN is a two-step approach where feature learning and clustering are decoupled. First, a self-supervised task is employed to obtain semantically meaningful features. Second, the obtained features are used as a prior in a learnable clustering  approach.\r\n\r\nImage source: [Gansbeke et al.](https://arxiv.org/pdf/2005.12320v2.pdf)",
  "title": "SCAN: Learning to Classify Images without Labels",
  "collection": "Clustering",
  "area": "General"
}
{
  "name": "PnP",
  "full_name": "PnP",
  "description": "**PnP**, or **Poll and Pool**, is sampling module extension for [DETR](https://paperswithcode.com/method/detr)-type architectures that adaptively allocates its computation spatially to be more efficient. Concretely, the PnP module abstracts the image feature map into fine foreground object feature vectors and a small number of coarse background contextual feature vectors. The [transformer](https://paperswithcode.com/method/transformer) models information interaction within the fine-coarse feature space and translates the features into the detection result.",
  "title": "PnP-DETR: Towards Efficient Visual Analysis with Transformers",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "PatchAugment",
  "full_name": "PatchAugment: Local Neighborhood Augmentation in Point Cloud Classification",
  "description": "Recent deep neural network models trained on smaller and less diverse datasets use data augmentation to alleviate limitations such as overfitting, reduced robustness, and lower generalization. Methods using 3D datasets are among the most common to use data augmentation techniques such as random point drop, scaling, translation, rotations, and jittering. However, these data augmentation techniques are fixed and are often applied to the entire object, ignoring the object’s local geometry. Different local neighborhoods on the object surface hold a different amount of geometric complexity. Applying the same data augmentation techniques at the object level is less effective in augmenting local neighborhoods with complex structures. This paper presents PatchAugment, a data augmentation framework to apply different augmentation techniques to the local neighborhoods. Our experimental studies on PointNet++ and DGCNN models demonstrate the effectiveness of PatchAugment on the task of 3D Point Cloud Classification. We evaluated our technique against these models using four benchmark datasets, ModelNet40 (synthetic), ModelNet10 (synthetic), SHREC’16 (synthetic), and ScanObjectNN (real-world).\r\n\r\n[[ICCVW 2021]](https://openaccess.thecvf.com/content/ICCV2021W/DLGC/papers/Sheshappanavar_PatchAugment_Local_Neighborhood_Augmentation_in_Point_Cloud_Classification_ICCVW_2021_paper.pdf) PatchAugment: Local Neighborhood Augmentation in Point Cloud Classification. [[Code]](https://github.com/VimsLab/PatchAugment)",
  "title": null,
  "collection": "Point Cloud Augmentation",
  "area": "Computer Vision"
}
{
  "name": "GAN",
  "full_name": "Generative Adversarial Network",
  "description": "A **GAN**, or **Generative Adversarial Network**, is a generative model that simultaneously trains\r\ntwo models: a generative model $G$ that captures the data distribution, and a discriminative model $D$ that estimates the\r\nprobability that a sample came from the training data rather than $G$.\r\n\r\nThe training procedure for $G$ is to maximize the probability of $D$ making\r\na mistake. This framework corresponds to a minimax two-player game. In the\r\nspace of arbitrary functions $G$ and $D$, a unique solution exists, with $G$\r\nrecovering the training data distribution and $D$ equal to $\\frac{1}{2}$\r\neverywhere. In the case where $G$ and $D$ are defined by multilayer perceptrons,\r\nthe entire system can be trained with backpropagation. \r\n\r\n(Image Source: [here](http://www.kdnuggets.com/2017/01/generative-adversarial-networks-hot-topic-machine-learning.html))",
  "title": "Generative Adversarial Networks",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Models Genesis",
  "full_name": "Models Genesis",
  "description": "**Models Genesis**, or **Generic Autodidactic Models**, is a self-supervised approach for learning 3D image representations. The objective of Models Genesis is to learn a common image representation that is transferable and generalizable across diseases, organs, and modalities.  It consists of an encoder-decoder architecture with skip connections in between, and is trained to learn a common image representation by restoring the original sub-volume $x\\_{i}$ (as ground truth) from the transformed one $\\bar{x}\\_{i}$ (as input), in which the reconstruction loss (MSE) is computed between the model prediction $x'\\_{0}$ and ground truth $x\\_{i}$. Once trained, the encoder alone can be fine-tuned for target classification tasks; while the encoder and decoder together can be fine-tuned for target segmentation tasks.",
  "title": "Models Genesis",
  "collection": "3D Representations",
  "area": "Computer Vision"
}
{
  "name": "Reduction-B",
  "full_name": "Reduction-B",
  "description": "**Reduction-B** is an image model block used in the [Inception-v4](https://paperswithcode.com/method/inception-v4) architecture.",
  "title": "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Holographic Reduced Representation",
  "full_name": "Holographic Reduced Representation",
  "description": "**Holographic Reduced Representations** are a simple mechanism to represent an associative array of key-value pairs in a fixed-size vector. Each individual key-value pair is the same size as the entire associative array; the array is represented by the sum of the pairs. Concretely, consider a complex vector key $r = (a\\_{r}[1]e^{iφ\\_{r}[1]}, a\\_{r}[2]e^{iφ\\_{r}[2]}, \\dots)$, which is the same size as the complex vector value x. The pair is \"bound\" together by element-wise complex multiplication, which multiplies the moduli and adds the phases of the elements:\r\n\r\n$$ y = r \\otimes x $$\r\n\r\n$$ y =  \\left(a\\_{r}[1]a\\_{x}[1]e^{i(φ\\_{r}[1]+φ\\_{x}[1])}, a\\_{r}[2]a\\_{x}[2]e^{i(φ\\_{r}[2]+φ\\_{x}[2])}, \\dots\\right) $$\r\n\r\nGiven keys $r\\_{1}$, $r\\_{2}$, $r\\_{3}$ and input vectors $x\\_{1}$, $x\\_{2}$, $x\\_{3}$, the associative array is:\r\n\r\n$$c = r\\_{1} \\otimes x\\_{1} + r\\_{2} \\otimes x\\_{2} + r\\_{3} \\otimes x\\_{3} $$\r\n\r\nwhere we call $c$ a memory trace. Define the key inverse:\r\n\r\n$$ r^{-1} = \\left(a\\_{r}[1]^{−1}e^{−iφ\\_{r}[1]}, a\\_{r}[2]^{−1}e^{−iφ\\_{r}[2]}, \\dots\\right) $$\r\n\r\nTo retrieve the item associated with key $r\\_{k}$, we multiply the memory trace element-wise by the vector $r^{-1}\\_{k}$. For example: \r\n\r\n$$ r\\_{2}^{−1} \\otimes c = r\\_{2}^{-1} \\otimes \\left(r\\_{1} \\otimes x\\_{1} + r\\_{2} \\otimes x\\_{2} + r\\_{3} \\otimes x\\_{3}\\right) $$\r\n\r\n$$ r\\_{2}^{−1} \\otimes c = x\\_{2} + r^{-1}\\_{2} \\otimes \\left(r\\_{1} \\otimes x\\_{1} + r\\_{3} \\otimes x3\\right) $$\r\n\r\n$$ r\\_{2}^{−1} \\otimes c = x\\_{2} + noise $$\r\n\r\nThe product is exactly $x\\_{2}$ together with a noise term. If the phases of the elements of the key vector are randomly distributed, the noise term has zero mean.\r\n\r\nSource: [Associative LSTMs](https://arxiv.org/pdf/1602.03032.pdf)",
  "title": null,
  "collection": "Miscellaneous Components",
  "area": "General"
}
{
  "name": "PolyCAM",
  "full_name": "Poly-CAM",
  "description": "",
  "title": "Poly-CAM: High resolution class activation map for convolutional neural networks",
  "collection": "Explainable CNNs",
  "area": "Computer Vision"
}
{
  "name": "VisTR",
  "full_name": "VisTR",
  "description": "**VisTR** is a [Transformer](https://paperswithcode.com/method/transformer) based video instance segmentation model. It views video instance segmentation as a direct end-to-end parallel sequence decoding/prediction problem. Given a video clip consisting of multiple image frames as input, VisTR outputs the sequence of masks for each instance in the video in order directly. At the core is a new, effective instance sequence matching and segmentation strategy, which supervises and segments instances at the sequence level as a whole. VisTR frames the instance segmentation and tracking in the same perspective of similarity learning, thus considerably simplifying the overall pipeline and is significantly different from existing approaches.",
  "title": "End-to-End Video Instance Segmentation with Transformers",
  "collection": "Instance Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "Spectral Normalization",
  "full_name": "Spectral Normalization",
  "description": "**Spectral Normalization** is a normalization technique used for generative adversarial networks, used to stabilize training of the discriminator. Spectral normalization has the convenient property that the Lipschitz constant is the only hyper-parameter to be tuned.\r\n\r\nIt controls the Lipschitz constant of the discriminator $f$ by constraining the spectral norm of each layer $g : \\textbf{h}\\_{in} \\rightarrow \\textbf{h}_{out}$. The Lipschitz norm $\\Vert{g}\\Vert\\_{\\text{Lip}}$ is equal to $\\sup\\_{\\textbf{h}}\\sigma\\left(\\nabla{g}\\left(\\textbf{h}\\right)\\right)$, where $\\sigma\\left(a\\right)$ is the spectral norm of the matrix $A$ ($L\\_{2}$ matrix norm of $A$):\r\n\r\n$$ \\sigma\\left(a\\right) = \\max\\_{\\textbf{h}:\\textbf{h}\\neq{0}}\\frac{\\Vert{A\\textbf{h}}\\Vert\\_{2}}{\\Vert\\textbf{h}\\Vert\\_{2}} = \\max\\_{\\Vert\\textbf{h}\\Vert\\_{2}\\leq{1}}{\\Vert{A\\textbf{h}}\\Vert\\_{2}} $$\r\n\r\nwhich is equivalent to the largest singular value of $A$. Therefore for a [linear layer](https://paperswithcode.com/method/linear-layer) $g\\left(\\textbf{h}\\right) = W\\textbf{h}$ the norm is given by $\\Vert{g}\\Vert\\_{\\text{Lip}} = \\sup\\_{\\textbf{h}}\\sigma\\left(\\nabla{g}\\left(\\textbf{h}\\right)\\right) = \\sup\\_{\\textbf{h}}\\sigma\\left(W\\right) = \\sigma\\left(W\\right) $. Spectral normalization normalizes the spectral norm of the weight matrix $W$ so it satisfies the Lipschitz constraint $\\sigma\\left(W\\right) = 1$:\r\n\r\n$$ \\bar{W}\\_{\\text{SN}}\\left(W\\right) = W / \\sigma\\left(W\\right) $$",
  "title": "Spectral Normalization for Generative Adversarial Networks",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "RandAugment",
  "full_name": "RandAugment",
  "description": "**RandAugment** is an automated data augmentation method. The search space for data augmentation has 2 interpretable hyperparameter $N$ and $M$.  $N$ is the number of augmentation transformations to apply sequentially, and $M$ is the magnitude for all the transformations. To reduce the parameter space but still maintain image diversity, learned policies and probabilities for applying each transformation are replaced with a parameter-free procedure of always selecting a transformation with uniform probability $\\frac{1}{K}$. Here $K$ is the number of transformation options. So given $N$ transformations for a training image, RandAugment may thus express $KN$ potential policies.\r\n\r\nTransformations applied include identity transformation, autoContrast, equalize, rotation, solarixation, colorjittering, posterizing, changing contrast, changing brightness, changing sharpness, shear-x, shear-y, translate-x, translate-y.",
  "title": "RandAugment: Practical automated data augmentation with a reduced search space",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "LSH Attention",
  "full_name": "Locality Sensitive Hashing Attention",
  "description": "**LSH Attention**, or **Locality Sensitive Hashing Attention** is a replacement for [dot-product attention](https://paperswithcode.com/method/scaled) with one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\\log L$), where $L$ is the length of the sequence. LSH refers to a family of functions (known as LSH families) to hash data points into buckets so that data points near each other are located in the same buckets with high probability, while data points far from each other are likely to be in different buckets. It was proposed as part of the [Reformer](https://paperswithcode.com/method/reformer) architecture.",
  "title": "Reformer: The Efficient Transformer",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "ShiLU",
  "full_name": "Shifted Rectified Linear Unit",
  "description": "The **Shifted Rectified Linear Unit**, or **ShiLU**, is a modification of **[ReLU](https://paperswithcode.com/method/relu)** activation function that has trainable parameters.\r\n\r\n$$ShiLU(x) = \\alpha ReLU(x) + \\beta$$",
  "title": "Trainable Activations for Image Classification",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "CodeT5",
  "full_name": "CodeT5",
  "description": "**CodeT5** is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based model for code understanding and generation based on the [T5 architecture](https://paperswithcode.com/method/t5). It utilizes an identifier-aware pre-training objective that considers the crucial token type information (identifiers) from code. Specifically, the denoising [Seq2Seq](https://paperswithcode.com/method/seq2seq) objective of T5 is extended with two identifier tagging and prediction tasks to enable the model to better leverage the token type information from programming languages, which are the identifiers assigned by developers. To improve the natural language-programming language alignment, a bimodal dual learning objective is used for a bidirectional conversion between natural language and programming language.",
  "title": "CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation",
  "collection": "Code Generation Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "DecomCAM",
  "full_name": "Decomposition-Integration Class Activation Map",
  "description": "DecomCAM decom\u0002poses intermediate activation maps into orthogonal features using singular value decomposition and generates saliency maps by integrating them.",
  "title": "Decom--CAM: Tell Me What You See, In Details! Feature-Level Interpretation via Decomposition Class Activation Map",
  "collection": "Interpretability",
  "area": "General"
}
{
  "name": "ElasticFace",
  "full_name": "Elastic Margin Loss for Deep Face Recognition",
  "description": "",
  "title": "ElasticFace: Elastic Margin Loss for Deep Face Recognition",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "PAR Transformer",
  "full_name": "PAR Transformer",
  "description": "**PAR Transformer** is a [Transformer](https://paperswithcode.com/methods/category/transformers) model that uses 63% fewer [self-attention blocks](https://paperswithcode.com/method/scaled), replacing them with [feed-forward blocks](https://paperswithcode.com/method/position-wise-feed-forward-layer), while retaining test accuracies. It is based on the [Transformer-XL](https://paperswithcode.com/method/transformer-xl) architecture and uses [neural architecture search](https://paperswithcode.com/task/architecture-search) to find an an efficient pattern of blocks in the transformer architecture.",
  "title": "Pay Attention when Required",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Augmented SBERT",
  "full_name": "Augmented SBERT",
  "description": "**Augmented SBERT** is a data augmentation strategy for pairwise sentence scoring that uses a [BERT](https://paperswithcode.com/method/bert) cross-encoder to improve the performance for the [SBERT](https://paperswithcode.com/method/sbert) bi-encoders. Given a pre-trained, well-performing crossencoder, we sample sentence pairs according to a certain sampling strategy and label these using the cross-encoder. We call these weakly labeled examples the silver dataset and they will be merged with the gold training dataset. We then train the bi-encoder on this extended training dataset.",
  "title": "Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks",
  "collection": "Text Augmentation",
  "area": "Natural Language Processing"
}
{
  "name": "LeNet",
  "full_name": "LeNet",
  "description": "**LeNet** is a classic convolutional neural network employing the use of convolutions, pooling and fully connected layers. It was used for the handwritten digit recognition task with the MNIST dataset. The architectural design served as inspiration for future networks such as [AlexNet](https://paperswithcode.com/method/alexnet) and [VGG](https://paperswithcode.com/method/vgg)..\r\n\r\n[code](https://github.com/Elman295/Paper_with_code/blob/main/LeNet_5_Pytorch.ipynb)",
  "title": null,
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "FastSpeech 2s",
  "full_name": "FastSpeech 2s",
  "description": "**FastSpeech 2s** is a text-to-speech model that abandons mel-spectrograms as intermediate output completely and directly generates speech waveform from text during inference. In other words there is no cascaded mel-spectrogram generation (acoustic model) and waveform generation (vocoder). FastSpeech 2s generates waveform conditioning on intermediate hidden, which makes it more compact in inference by discarding the mel-spectrogram decoder.\r\n\r\nTwo main design changes are made to the waveform decoder. \r\n\r\nFirst, considering that the phase information is difficult to predict using a variance predictor, [adversarial training](https://paperswithcode.com/methods/category/adversarial-training)  is used in the waveform decoder to force it to implicitly recover the phase information by itself. \r\n\r\nSecondly, the mel-spectrogram decoder of [FastSpeech 2](https://paperswithcode.com/method/fastspeech-2) is leveraged, which is trained on the full text sequence to help on the text feature extraction. As shown in the Figure, the waveform decoder is based on the structure of [WaveNet](https://paperswithcode.com/method/wavenet) including non-causal convolutions and gated activation. The waveform decoder takes a sliced hidden sequence corresponding to a short audio clip as input and upsamples it with transposed 1D-convolution to match the length of audio clip. The discriminator in the adversarial training adopts the same structure in Parallel WaveGAN, which consists of ten layers of non-causal [dilated 1-D convolutions](https://paperswithcode.com/method/dilated-convolution) with [leaky ReLU](https://paperswithcode.com/method/leaky-relu) activation function. The waveform decoder is optimized by the multi-resolution STFT loss and the [LSGAN discriminator](https://paperswithcode.com/method/lsgan) loss following Parallel WaveGAN. \r\n\r\nIn inference, the mel-spectrogram decoder is discarded and only the waveform decoder is used to synthesize speech audio.",
  "title": "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech",
  "collection": "Text-to-Speech Models",
  "area": "Audio"
}
{
  "name": "ASAF",
  "full_name": "Adaptive Spline Activation Function",
  "description": "Stefano Guarnieri, Francesco Piazza, and Aurelio Uncini \r\n\"Multilayer Feedforward Networks with Adaptive Spline Activation Function,\" \r\n IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 3, MAY 1999\r\n\r\nAbstract — In this paper, a new adaptive spline activation function neural network (ASNN) is presented. Due to the ASNN’s high representation capabilities, networks with a small number of interconnections can be trained to solve both pattern recognition and data processing real-time problems. The main idea is to use a Catmull–Rom cubic spline as the neuron’s activation function, which ensures a simple structure suitable for both software and hardware implementation. Experimental results demonstrate improvements in terms of generalization capability\r\nand of learning speed in both pattern recognition and data processing tasks.\r\nIndex Terms— Adaptive activation functions, function shape autotuning, generalization, generalized sigmoidal functions, multilayer\r\nperceptron, neural networks, spline neural networks.",
  "title": "Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Depthwise Separable Convolution",
  "full_name": "Depthwise Separable Convolution",
  "description": "While [standard convolution](https://paperswithcode.com/method/convolution) performs the channelwise and spatial-wise computation in one step, **Depthwise Separable Convolution**  splits the computation into two steps: [depthwise convolution](https://paperswithcode.com/method/depthwise-convolution) applies a single convolutional filter per each input channel and [pointwise convolution](https://paperswithcode.com/method/pointwise-convolution) is used to create a linear combination of the output of the depthwise convolution. The comparison of standard convolution and depthwise separable convolution is shown to the right.\r\n\r\nCredit: [Depthwise Convolution Is All You Need for Learning Multiple Visual Domains](https://paperswithcode.com/paper/depthwise-convolution-is-all-you-need-for)",
  "title": "Xception: Deep Learning With Depthwise Separable Convolutions",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "AlphaStar",
  "full_name": "DeepMind AlphaStar",
  "description": "**AlphaStar** is a reinforcement learning agent for tackling the game of Starcraft II. It learns a policy $\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}, z\\right) = P\\left[a\\_{t}\\mid{s\\_{t}}, z\\right]$ using a neural network for parameters $\\theta$ that receives observations $s\\_{t} = \\left(o\\_{1:t}, a\\_{1:t-1}\\right)$ as inputs and chooses actions as outputs. Additionally, the policy conditions on a statistic $z$ that summarizes a strategy sampled from human data such as a build order [1].\r\n\r\nAlphaStar uses numerous types of architecture to incorporate different types of features. Observations of player and enemy units are processed with a [Transformer](https://paperswithcode.com/method/transformer). Scatter connections are used to integrate spatial and non-spatial information. The temporal sequence of observations is processed by a core [LSTM](https://paperswithcode.com/method/lstm). Minimap features are extracted with a Residual Network. To manage the combinatorial action space, the agent uses an autoregressive policy and a recurrent [pointer network](https://paperswithcode.com/method/pointer-net).\r\n\r\nThe agent is trained first with supervised learning from human replays. Parameters are subsequently trained using reinforcement learning that maximizes the win rate against opponents.  The RL algorithm is based on a policy-gradient algorithm similar to actor-critic. Updates are performed asynchronously and off-policy. To deal with this, a combination of $TD\\left(\\lambda\\right)$ and [V-trace](https://paperswithcode.com/method/v-trace) are used, as well as a new self-imitation algorithm (UPGO).\r\n\r\nLastly, to address game-theoretic challenges, AlphaStar is trained with league training to try to approximate a fictitious self-play (FSP) setting which avoids cycles by computing a best response against a uniform mixture of all previous policies. The league of potential opponents includes a diverse range of agents, including policies from current and previous agents.\r\n\r\nImage Credit: [Yekun Chai](https://ychai.uk/notes/2019/07/21/RL/DRL/Decipher-AlphaStar-on-StarCraft-II/)\r\n\r\n####  References\r\n1. Chai, Yekun. \"AlphaStar: Grandmaster level in StarCraft II Explained.\" (2019).  [https://ychai.uk/notes/2019/07/21/RL/DRL/Decipher-AlphaStar-on-StarCraft-II/](https://ychai.uk/notes/2019/07/21/RL/DRL/Decipher-AlphaStar-on-StarCraft-II/)\r\n\r\n#### Code Implementation\r\n1. https://github.com/opendilab/DI-star",
  "title": null,
  "collection": "Video Game Models",
  "area": "Reinforcement Learning"
}
{
  "name": "Context Enhancement Module",
  "full_name": "Context Enhancement Module",
  "description": "**Context Enhancement Module (CEM)** is a feature extraction module used in object detection (specifically, [ThunderNet](https://paperswithcode.com/method/thundernet)) which aims to  to enlarge the receptive field. The key idea of CEM is to aggregate multi-scale local context information and global context information to generate more discriminative features. In CEM, the feature maps from three scales are merged: $C\\_{4}$, $C\\_{5}$ and $C\\_{glb}$. $C\\_{glb}$ is the global context feature vector by applying a [global average pooling](https://paperswithcode.com/method/global-average-pooling) on $C\\_{5}$. We then apply a 1 × 1 [convolution](https://paperswithcode.com/method/convolution) on each feature map to squeeze the number of channels to $\\alpha \\times p \\times p = 245$.\r\n\r\nAfterwards, $C\\_{5}$ is upsampled by 2× and $C\\_{glb}$ is broadcast so that the spatial dimensions of the three feature maps are\r\nequal. At last, the three generated feature maps are aggregated. By leveraging both local and global context, CEM effectively enlarges the receptive field and refines the representation ability of the thin feature map. Compared with prior [FPN](https://paperswithcode.com/method/fpn) structures, CEM involves only two 1×1 convolutions and a fc layer.",
  "title": "ThunderNet: Towards Real-time Generic Object Detection",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "RAE",
  "full_name": "Regularized Autoencoders",
  "description": "This method introduces several regularization schemes that can be applied to an Autoencoder. To make the model generative *ex-post* density estimation is proposed and consists in fitting a Mixture of Gaussian distribution on the train data embeddings after the model is trained.",
  "title": "From Variational to Deterministic Autoencoders",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "NN4G",
  "full_name": "Neural network for graphs",
  "description": "NN4G is based on a constructive feedforward architecture with state variables that uses neurons with no feedback connections. The neurons are applied to the input graphs by a general traversal process that relaxes the constraints of previous approaches derived by the causality assumption over hierarchical input data.\r\n\r\nDescription from: [Neural network for graphs: a contextual constructive approach](https://www.meta.org/papers/neural-network-for-graphs-a-contextual/19193509)",
  "title": null,
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "DECA",
  "full_name": "Detailed Expression Capture and Animation",
  "description": "**Detailed Expression Capture and Animation**, or **DECA**, is a model for 3D face reconstruction that is trained to robustly produce a UV displacement map from a low-dimensional latent representation that consists of person-specific detail parameters and generic expression parameters, while a regressor is trained to predict detail, shape, albedo, expression, pose and illumination parameters from a single image. A detail-consistency loss is used to disentangle person-specific details and expression-dependent wrinkles. This disentanglement allows us to synthesize realistic person-specific wrinkles by controlling expression parameters while keeping person-specific details unchanged.",
  "title": "Learning an Animatable Detailed 3D Face Model from In-The-Wild Images",
  "collection": "3D Face Mesh Models",
  "area": "Computer Vision"
}
{
  "name": "rTPNN",
  "full_name": "Recurrent Trend Predictive Neural Network",
  "description": "A neural network model to automatically capture trends in time-series data for improved prediction/forecasting performance",
  "title": "Recurrent Trend Predictive Neural Network for Multi-Sensor Fire Detection",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "SM3",
  "full_name": "SM3",
  "description": "# Memory-Efficient Adaptive Optimization\r\n\r\nSource: https://arxiv.org/abs/1901.11150\r\n\r\nAdaptive gradient-based optimizers such as [AdaGrad](https://paperswithcode.com/method/adagrad) and [Adam](https://paperswithcode.com/method/adam) are among the\r\ndefacto methods of choice in modern machine learning.These methods tune the learning rate for each parameter during the optimization process using cumulative second-order statistics. These methods provide superior convergence properties and are very attractive in large scale applications due to their moderate time and space requirements which are linear in the number of parameters.\r\n\r\n\r\nHowever, the recent advances in natural language processing such as [BERT](https://paperswithcode.com/method/bert) and GPT2 show that models with 10<sup>8</sup> to 10<sup>10</sup> parameters, trained with adaptive optimization methods, achieve state-of-the-art results. In such cases, the memory overhead of the optimizer can restrict the size of the model that can be used as well as the batch size, both of which can have a dramatic effect on the quality of the final model.\r\n\r\n\r\nHere we construct a new adaptive optimization method that retains most of the benefits of standard per-parameter adaptivity while significantly reducing memory overhead.\r\n\r\n\r\nWe observe that in standard neural networks that certain entries of the stochastic gradients have (on average) similar values, and exhibit what we refer to as an activation pattern. For example, in gradients of embedding layers of deep networks, an entire row (or column) is either zero or non-zero. Similarly, in intermediate layers we often observe that gradients associated with the same unit are of similar order of magnitude. In these cases, a similar phenomenon is observed in the second-order statistics maintained by adaptive methods. With this key observation, to reduce the memory overhead of the optimizer our method takes in a cover set of the parameters. Cover sets are typically selected in practice such that parameters in each of the sets have second order statistics of similar magnitude. Our method is general enough that it can easily be extended to arbitrary cover sets. For parameters of deep networks that are organized as a collection of tensors, we form a cover consisting of slices of codimension one for each tensor. Thus, for an m x n parameter matrix, the cover consists of rows and columns of the matrix. The memory requirements therefore drop from mxn to merely m+n. For a parameter tensor of rank p, with dimensions n<sub>1</sub>  ...   n<sub>p</sub>, the reduction in memory consumption is even more pronounced, dropping from product of all the dimensions to the sum of all dimensions. This virtually eliminates the memory overhead associated with maintaining the adaptive learning rates!\r\n\r\nAnother practical aspect worthy of note is that our method does not require an external hand engineered learning rate decay schedule but instead relies on the per parameter adaptivity that is natural to its update rule which makes it easier to tune. We provide details in the supplementary section of the paper.\r\n\r\n## Advice on using SM3 on your model\r\n\r\n### Learning rate warm-up:\r\n\r\n```python\r\nlearning_rate = lr_constant * tf.minimum(1.0, (warm_up_step / global_step) ** p)\r\n```\r\n\r\n* p = 1, linear ramp up of learning rate.\r\n* p = 2, quadratic ramp up of learning rate [preferred].\r\n\r\nWe typically set `warm_up_step` as 5% of overall steps. Initially, the norm of the preconditioned gradient is much larger than norm of the weights. Learning rate warmup allows us to heuristically fix this scale mismatch.\r\n\r\n### Learning rate decay:\r\n\r\nWe make use accumulated gradient squares for the decay. This means that each coordinate gets its own natural decay based on the scales of the gradients over time. Hence, users need not put in an external learning rate decay schedule. Moreover, we found in our experiments with translation and language models that this approach is superior to a hand-tuned learning rate decay schedules which is typically combined with exponential moving averages of the gradient squares.\r\n\r\nHaving said that if users want to add exponential moving averages instead of the standard accumulated gradient squares - It's easy to modify the optimizer implementation to switch to exponential moving averages.\r\n\r\nFor rank > 1:\r\n\r\n|            from                     |                  to                 |\r\n|-------------------------------------|-------------------------------------|\r\n|  current_accumulator += grad * grad |  current_accumulator = beta * current_accumulator + (1-beta) * grad * grad |\r\n\r\n\r\nFor rank <= 1:\r\n\r\n\r\n|            from                     |                  to                 |\r\n|-------------------------------------|-------------------------------------|\r\n|  current_accumulator = tf.assign_add(accumulator, grad * grad) |   current_accumulator = tf.assign(accumulator, beta * accumulator + (1-beta) * (grad * grad)) |\r\n\r\n\r\n### [Polyak averaging](https://paperswithcode.com/method/polyak-averaging) of parameters: \r\nIt's useful to run [polyak averaging](https://www.tensorflow.org/api_docs/python/tf/train/ExponentialMovingAverage) of the parameters. These parameters are then used in inference / serving. Using the averaged parameters instead of the last iterate typically improves the overall performance of the model.\r\n\r\nAn **alternative** to polyak averaging which does not make use of extra memory is to decay the learning rate from the constant to zero for the last 10% of the steps of your entire training run, we term the phase a **cool-down** phase of the model. As training makes smaller and smaller steps the final iterate can be thought of as an average iterate.",
  "title": "Memory Efficient Adaptive Optimization",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Dueling Network",
  "full_name": "Dueling Network",
  "description": "A **Dueling Network** is a type of Q-Network that has two streams to separately estimate (scalar) state-value and the advantages for each action. Both streams share a common convolutional feature learning module. The two streams are combined via a special aggregating layer to produce an\r\nestimate of the state-action value function Q as shown in the figure to the right.\r\n\r\nThe last module uses the following mapping:\r\n\r\n$$ Q\\left(s, a, \\theta, \\alpha, \\beta\\right) =V\\left(s, \\theta, \\beta\\right) + \\left(A\\left(s, a, \\theta, \\alpha\\right) - \\frac{1}{|\\mathcal{A}|}\\sum\\_{a'}A\\left(s, a'; \\theta, \\alpha\\right)\\right) $$\r\n\r\nThis formulation is chosen for identifiability so that the advantage function has zero advantage for the chosen action, but instead of a maximum we use an average operator to increase the stability of the optimization.",
  "title": "Dueling Network Architectures for Deep Reinforcement Learning",
  "collection": "Q-Learning Networks",
  "area": "Reinforcement Learning"
}
{
  "name": "Chinchilla",
  "full_name": "Chinchilla",
  "description": "Chinchilla is a 70B parameters model trained as a compute-optimal model with 1.4 trillion tokens. Findings suggest that these types of models are trained optimally by equally scaling both model size and training tokens. It uses the same compute budget as Gopher but with 4x more training data. Chinchilla and Gopher are trained for the same number of FLOPs. It is trained using [MassiveText](/dataset/massivetext) using a slightly modified SentencePiece tokenizer. More architectural details in the paper.",
  "title": "Training Compute-Optimal Large Language Models",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "Attention Feature Filters",
  "full_name": "Attention Feature Filters",
  "description": "An attention mechanism for content-based filtering of multi-level features. For example, recurrent features obtained by forward and backward passes of a bidirectional RNN block can be combined using attention feature filters, with unprocessed input features/embeddings as queries and recurrent features as keys/values.",
  "title": "NeuriCam: Key-Frame Video Super-Resolution and Colorization for IoT Cameras",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Ternary Weight Splitting",
  "full_name": "Ternary Weight Splitting",
  "description": "**Ternary Weight Splitting** is a ternarization approach used in [BinaryBERT](https://www.paperswithcode.com/method/binarybert) that exploits the flatness of ternary loss landscape as the optimization proxy of the binary model. We first train the half-sized ternary BERT to convergence, and then split both the latent full-precision weight $\\mathbf{w}^{t}$ and quantized $\\hat{\\mathbf{w}}^{t}$ to their binary counterparts $\\mathbf{w}\\_{1}^{b}, \\mathbf{w}\\_{2}^{b}$ and $\\hat{\\mathbf{w}}\\_{1}^{b}, \\hat{\\mathbf{w}}\\_{2}^{b}$ via the TWS operator. To inherit the performance of the ternary model after splitting, the TWS operator requires the splitting equivalency (i.e., the same output given the same input):\r\n\r\n$$\r\n\\mathbf{w}^{t}=\\mathbf{w}\\_{1}^{b}+\\mathbf{w}\\_{2}^{b}, \\quad \\hat{\\mathbf{w}}^{t}=\\hat{\\mathbf{w}}\\_{1}^{b}+\\hat{\\mathbf{w}}\\_{2}^{b}\r\n$$\r\n\r\nWhile solution to the above equation is not unique, we constrain the latent full-precision weights after splitting $\\mathbf{w}\\_{1}^{b}, \\mathbf{w}\\_{2}^{b}$ to satisfy $\\mathbf{w}^{t}=\\mathbf{w}\\_{1}^{b}+\\mathbf{w}\\_{2}^{b}$. See the paper for more details.",
  "title": "BinaryBERT: Pushing the Limit of BERT Quantization",
  "collection": "Ternarization",
  "area": "General"
}
{
  "name": "FCOS",
  "full_name": "FCOS",
  "description": "**FCOS** is an anchor-box free, proposal free, single-stage object detection model. By eliminating the predefined set of anchor boxes, FCOS avoids computation related to anchor boxes such as calculating overlapping during training. It also avoids all hyper-parameters related to anchor boxes, which are often very sensitive to the final detection performance.",
  "title": "FCOS: Fully Convolutional One-Stage Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "LMOT",
  "full_name": "LMOT: Efficient Light-Weight Detection and Tracking in Crowds",
  "description": "Rana Mostafa, Hoda Baraka and AbdelMoniem Bayoumi\r\n\r\n**LMOT**, i.e., Light-weight Multi-Object Tracker,  performs joint pedestrian detection and tracking. LMOT introduces a simplified DLA-34 encoder network to extract detection features for the current image that are computationally efficient. Furthermore, we generate efficient tracking features using a linear transformer for the prior image frame and its corresponding detection heatmap. After that, LMOT fuses both detection and tracking feature maps in a multi-layer scheme and performs a two-stage online data association relying on the Kalman filter to generate tracklets. We evaluated our model on the challenging real-world MOT16/17/20 datasets, showing LMOT significantly outperforms the state-of-the-art trackers concerning runtime while maintaining high robustness. LMOT is approximately ten times faster than state-of-the-art trackers while being only 3.8% behind in performance accuracy on average leading to a much computationally lighter model.\r\n\r\nCode: https://github.com/RanaMostafaAbdElMohsen/LMOT\r\nPaper: https://doi.org/10.1109/ACCESS.2022.3197157",
  "title": null,
  "collection": "Multi-Object Tracking Models",
  "area": "Computer Vision"
}
{
  "name": "Siren",
  "full_name": "Sinusoidal Representation Network",
  "description": "**Siren**, or **Sinusoidal Representation Network**, is a periodic activation function for implicit neural representations. Specifically it uses the sine as a periodic activation function:\r\n\r\n$$ \\Phi\\left(x\\right) = \\textbf{W}\\_{n}\\left(\\phi\\_{n-1} \\circ \\phi\\_{n-2} \\circ \\dots \\circ \\phi\\_{0} \\right) $$",
  "title": "Implicit Neural Representations with Periodic Activation Functions",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "BASE",
  "full_name": "Balanced Selection",
  "description": "",
  "title": "Active Learning at the ImageNet Scale",
  "collection": "Active Learning",
  "area": "General"
}
{
  "name": "SimpleNet",
  "full_name": "SimpleNet",
  "description": "**SimpleNet** is a convolutional neural network with 13 layers. The network employs a homogeneous design utilizing 3 × 3 kernels for convolutional layer and 2 × 2 kernels for pooling operations. The only layers which do not use 3 × 3 kernels are 11th and 12th layers, these layers, utilize 1 × 1 convolutional kernels. Feature-map down-sampling is carried out using nonoverlaping 2 × 2 max-pooling. In order to cope with the problem of vanishing gradient and also over-fitting, SimpleNet also uses batch-normalization with moving average fraction of 0.95 before any [ReLU](https://paperswithcode.com/method/relu) non-linearity.",
  "title": "Lets keep it simple, Using simple architectures to outperform deeper and more complex architectures",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "MelGAN Residual Block",
  "full_name": "MelGAN Residual Block",
  "description": "The **MelGAN Residual Block** is a convolutional [residual block](https://paperswithcode.com/method/residual-block) used in the [MelGAN](https://paperswithcode.com/method/melgan) generative audio architecture. It employs residual connections with dilated convolutions. Dilations are used so that temporally far output activations of each subsequent layer has significant overlapping inputs. Receptive field of a stack of [dilated convolution](https://paperswithcode.com/method/dilated-convolution) layers increases exponentially with the number of layers. Incorporating these into the MelGAN generator allows us to efficiently increase the induced receptive fields of each output time-step. This effectively implies larger overlap in the induced receptive field of far apart time-steps, leading to better long range correlation.",
  "title": "MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "NeRF",
  "full_name": "Neural Radiance Field",
  "description": "NeRF represents a scene with learned, continuous volumetric radiance field $F_\\theta$ defined over a bounded 3D volume. In a NeRF, $F_\\theta$ is a multilayer perceptron (MLP) that takes as input a 3D position $x = (x, y, z)$ and unit-norm viewing direction $d = (dx, dy, dz)$, and produces as output a density $\\sigma$ and color $c = (r, g, b)$. The weights of the multilayer perceptron that parameterize $F_\\theta$ are optimized so as to encode the radiance field of the scene. Volume rendering is used to compute the color of a single pixel.",
  "title": "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis",
  "collection": "3D Representations",
  "area": "Computer Vision"
}
{
  "name": "Hierarchical Feature Fusion",
  "full_name": "Hierarchical Feature Fusion",
  "description": "**Hierarchical Feature Fusion (HFF)** is a feature fusion method employed in [ESP](https://paperswithcode.com/method/esp) and [EESP](https://paperswithcode.com/method/eesp) image model blocks for degridding. In the ESP module, concatenating the outputs of dilated convolutions gives the ESP module a large effective receptive field, but it introduces unwanted checkerboard or gridding artifacts. To address the gridding artifact in ESP, the feature maps obtained using kernels of different dilation rates are hierarchically added before concatenating them (HFF). This solution is simple and effective and does not increase the complexity of the ESP module.",
  "title": "ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation",
  "collection": "Degridding",
  "area": "Computer Vision"
}
{
  "name": "Cascade R-CNN",
  "full_name": "Cascade R-CNN",
  "description": "**Cascade R-CNN** is an object detection architecture that seeks to address problems with degrading performance with increased IoU thresholds (due to overfitting during training and inference-time mismatch between IoUs for which detector is optimal and the inputs). It is a multi-stage extension of the [R-CNN](https://paperswithcode.com/method/r-cnn), where detector stages deeper into the cascade are sequentially more selective against close false positives. The cascade of R-CNN stages are trained sequentially, using the output of one stage to train the next. This is motivated by the observation that the output IoU of a regressor is almost invariably better than the input IoU. \r\n\r\nCascade R-CNN does not aim to mine hard negatives. Instead, by adjusting bounding boxes, each stage aims to find a good set of close false positives for training the next stage. When operating in this manner, a sequence of detectors adapted to increasingly higher IoUs can beat the overfitting problem, and thus be effectively trained. At inference, the same cascade procedure is applied. The progressively improved hypotheses are better matched to the increasing detector quality at each stage.",
  "title": "Cascade R-CNN: Delving into High Quality Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "BasicVSR",
  "full_name": "BasicVSR",
  "description": "**BasicVSR** is a video super-resolution pipeline including optical flow and [residual blocks](https://paperswithcode.com/method/residual-connection). It adopts a typical bidirectional recurrent network. The upsampling module $U$ contains multiple [pixel-shuffle](https://paperswithcode.com/method/pixelshuffle) and convolutions. In the Figure, red and blue colors represent the backward and forward propagations, respectively.  The propagation branches contain only generic components. $S, W$, and $R$ refer to the flow estimation module, spatial warping module, and residual blocks, respectively.",
  "title": "BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond",
  "collection": "Video Super-Resolution Models",
  "area": "Computer Vision"
}
{
  "name": "Inception Module",
  "full_name": "Inception Module",
  "description": "An **Inception Module** is an image model block that aims to approximate an optimal local sparse structure in a CNN. Put simply, it allows for us to use multiple types of filter size, instead of being restricted to a single filter size, in a single image block, which we then concatenate and pass onto the next layer.",
  "title": "Going Deeper with Convolutions",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Virtual Batch Normalization",
  "full_name": "Virtual Batch Normalization",
  "description": "**Virtual Batch Normalization** is a normalization method used for training generative adversarial networks that extends batch normalization. Regular [batch normalization](https://paperswithcode.com/method/batch-normalization) causes the output of a neural network for an input example $\\mathbf{x}$ to be highly dependent on several other inputs $\\mathbf{x}'$ in the same minibatch. To avoid this problem in virtual batch normalization (VBN), each example $\\mathbf{x}$ is normalized based on the statistics collected on a reference batch of examples that are chosen once and fixed at the start of training, and on $\\mathbf{x}$ itself. The reference batch is normalized using only its own statistics. VBN is computationally expensive because it requires running forward propagation on two minibatches of data, so the authors use it only in the generator network.",
  "title": "Improved Techniques for Training GANs",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "Prioritized Sweeping",
  "full_name": "Prioritized Sweeping",
  "description": "**Prioritized Sweeping** is a reinforcement learning technique for model-based algorithms that prioritizes updates according to a measure of urgency, and performs these updates first. A queue is maintained of every state-action pair whose estimated value would change nontrivially if updated, prioritized by the size of the change. When the top pair in the queue is updated, the effect on each of its predecessor pairs is computed. If the effect is greater than some threshold, then the pair is inserted in the queue with the new priority.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition",
  "title": null,
  "collection": "Efficient Planning",
  "area": "Reinforcement Learning"
}
{
  "name": "G3D",
  "full_name": "G3D",
  "description": "**G3D** is a unified spatial-temporal graph convolutional operator that directly models cross-spacetime joint dependencies. It leverages dense cross-spacetime edges as skip connections for direct information propagation across the 3D spatial-temporal graph.",
  "title": "Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition",
  "collection": "Action Recognition Blocks",
  "area": "General"
}
{
  "name": "ASVI",
  "full_name": "Automatic Structured Variational Inference",
  "description": "**Automatic Structured Variational Inference (ASVI)** is a fully automated method for constructing structured variational families, inspired by the closed-form update in conjugate Bayesian models. These convex-update families incorporate the forward pass of the input probabilistic program and can therefore capture complex statistical dependencies. Convex-update families have the same space and time complexity as the input probabilistic program and are therefore tractable for a very large family of models including both continuous and discrete variables.",
  "title": "Automatic structured variational inference",
  "collection": "Variational Optimization",
  "area": "General"
}
{
  "name": "DASPP",
  "full_name": "Deeper Atrous Spatial Pyramid Pooling",
  "description": "DASPP is a deeper version of the [ASPP](https://paperswithcode.com/method/aspp) module (the latter from [DeepLabv3](https://paperswithcode.com/method/deeplabv3)) that adds standard 3 × 3 [convolution](https://paperswithcode.com/method/convolution) after 3 × 3 dilated convolutions to refine the features and also fusing the input and the output of the DASPP module via short [residual connection](https://paperswithcode.com/method/residual-connection). Also, the number of convolution filters of ASPP is reduced from 255 to 96 to gain computational performance.",
  "title": "LiteSeg: A Novel Lightweight ConvNet for Semantic Segmentation",
  "collection": "Semantic Segmentation Modules",
  "area": "Computer Vision"
}
{
  "name": "CSPDenseNet-Elastic",
  "full_name": "CSPDenseNet-Elastic",
  "description": "**CSPDenseNet-Elastic** is a convolutional neural network and object detection backbone where we apply the Cross Stage Partial Network (CSPNet) approach to [DenseNet-Elastic](https://paperswithcode.com/method/densenet-elastic). The CSPNet partitions the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network.",
  "title": "CSPNet: A New Backbone that can Enhance Learning Capability of CNN",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Channel-wise Cross Attention",
  "full_name": "Channel-wise Cross Attention",
  "description": "**Channel-wise Cross Attention** is a module for semantic segmentation used in the [UCTransNet](https://paperswithcode.com/method/uctransnet) architecture. It is used to fuse features of inconsistent semantics between the Channel [Transformer](https://paperswithcode.com/method/transformer) and [U-Net](https://paperswithcode.com/method/u-net) decoder. It guides the channel and information filtration of the Transformer features and eliminates the ambiguity with the decoder features.\r\n\r\nMathematically, we take the $i$-th level Transformer output $\\mathbf{O\\_{i}} \\in \\mathbb{R}^{C×H×W}$ and i-th level decoder feature map $\\mathbf{D\\_{i}} \\in \\mathbb{R}^{C×H×W}$ as the inputs of Channel-wise Cross Attention. Spatial squeeze is performed by a [global average pooling](https://paperswithcode.com/method/global-average-pooling) (GAP) layer, producing vector $\\mathcal{G}\\left(\\mathbf{X}\\right) \\in \\mathbb{R}^{C×1×1}$ with its $k$th channel $\\mathcal{G}\\left(\\mathbf{X}\\right) = \\frac{1}{H×W}\\sum^{H}\\_{i=1}\\sum^{W}\\_{j=1}\\mathbf{X}^{k}\\left(i, j\\right)$. We use this operation to embed the global spatial information and then generate the attention mask:\r\n\r\n$$ \\mathbf{M}\\_{i} = \\mathbf{L}\\_{1} \\cdot \\mathcal{G}\\left(\\mathbf{O\\_{i}}\\right) + \\mathbf{L}\\_{2} \\cdot \\mathcal{G}\\left(\\mathbf{D}\\_{i}\\right) $$\r\n\r\nwhere $\\mathbf{L}\\_{1} \\in \\mathbb{R}^{C×C}$ and $\\mathbf{L}\\_{2} \\in \\mathbb{R}^{C×C}$ and being weights of two Linear layers and the [ReLU](https://paperswithcode.com/method/relu) operator $\\delta\\left(\\cdot\\right)$. This operation in the equation above encodes the channel-wise dependencies. Following [ECA-Net](https://paperswithcode.com/method/eca-net) which empirically showed avoiding dimensionality reduction is important for learning channel attention, the authors use a single [Linear layer](https://paperswithcode.com/method/linear-layer) and sigmoid function to build the channel attention map. The resultant vector is used to recalibrate or excite $\\mathbf{O\\_{i}}$ to $\\mathbf{\\bar{O}\\_{i}} = \\sigma\\left(\\mathbf{M\\_{i}}\\right) \\cdot \\mathbf{O\\_{i}}$, where the activation $\\sigma\\left(\\mathbf{M\\_{i}}\\right)$ indicates the importance of each channel. Finally, the masked $\\mathbf{\\bar{O}}\\_{i}$ is concatenated with the up-sampled features of the $i$-th level decoder.",
  "title": "UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "FixRes",
  "full_name": "FixRes",
  "description": "**FixRes** is an image scaling strategy that seeks to optimize classifier performance. It is motivated by the observation that data augmentations induce a significant discrepancy between the size of the objects seen by the classifier at train and test time: in fact, a lower train resolution improves the classification at test time! FixRes is a simple strategy to optimize the classifier performance, that employs different train and test resolutions. The calibrations are: (a) calibrating the object sizes by adjusting the crop size and (b) adjusting statistics before spatial pooling.",
  "title": "Fixing the train-test resolution discrepancy",
  "collection": "Image Scaling Strategies",
  "area": "Computer Vision"
}
{
  "name": "MAML",
  "full_name": "Model-Agnostic Meta-Learning",
  "description": "**MAML**, or **Model-Agnostic Meta-Learning**, is a model and task-agnostic algorithm for meta-learning that trains a model’s parameters such that a small number of gradient updates will lead to fast learning on a new task.\r\n\r\nConsider a model represented by a parametrized function $f\\_{\\theta}$ with parameters $\\theta$. When adapting to a new task $\\mathcal{T}\\_{i}$, the model’s parameters $\\theta$ become $\\theta'\\_{i}$. With MAML, the updated parameter vector $\\theta'\\_{i}$ is computed using one or more gradient descent updates on task $\\mathcal{T}\\_{i}$. For example, when using one gradient update,\r\n\r\n$$ \\theta'\\_{i} = \\theta - \\alpha\\nabla\\_{\\theta}\\mathcal{L}\\_{\\mathcal{T}\\_{i}}\\left(f\\_{\\theta}\\right) $$\r\n\r\nThe step size $\\alpha$ may be fixed as a hyperparameter or metalearned. The model parameters are trained by optimizing for the performance of $f\\_{\\theta'\\_{i}}$ with respect to $\\theta$ across tasks sampled from $p\\left(\\mathcal{T}\\_{i}\\right)$. More concretely the meta-objective is as follows:\r\n\r\n$$ \\min\\_{\\theta} \\sum\\_{\\mathcal{T}\\_{i} \\sim p\\left(\\mathcal{T}\\right)} \\mathcal{L}\\_{\\mathcal{T\\_{i}}}\\left(f\\_{\\theta'\\_{i}}\\right) = \\sum\\_{\\mathcal{T}\\_{i} \\sim p\\left(\\mathcal{T}\\right)} \\mathcal{L}\\_{\\mathcal{T\\_{i}}}\\left(f\\_{\\theta - \\alpha\\nabla\\_{\\theta}\\mathcal{L}\\_{\\mathcal{T}\\_{i}}\\left(f\\_{\\theta}\\right)}\\right) $$\r\n\r\nNote that the meta-optimization is performed over the model parameters $\\theta$, whereas the objective is computed using the updated model parameters $\\theta'$. In effect MAML aims to optimize the model parameters such that one or a small number of gradient steps on a new task will produce maximally effective behavior on that task. The meta-optimization across tasks is performed via stochastic gradient descent ([SGD](https://paperswithcode.com/method/sgd)), such that the model parameters $\\theta$ are updated as follows:\r\n\r\n$$ \\theta \\leftarrow \\theta - \\beta\\nabla\\_{\\theta} \\sum\\_{\\mathcal{T}\\_{i} \\sim p\\left(\\mathcal{T}\\right)} \\mathcal{L}\\_{\\mathcal{T\\_{i}}}\\left(f\\_{\\theta'\\_{i}}\\right)$$\r\n\r\nwhere $\\beta$ is the meta step size.",
  "title": "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks",
  "collection": "Meta-Learning Algorithms",
  "area": "General"
}
{
  "name": "VoVNetV2",
  "full_name": "VoVNetV2",
  "description": "**VoVNetV2** is a convolutional neural network that improves upon [VoVNet](https://paperswithcode.com/method/vovnet) with two effective strategies: (1) [residual connection](https://paperswithcode.com/method/residual-connection) for alleviating the optimization problem of larger VoVNets and (2) effective Squeeze-Excitation (eSE) dealing with the channel information loss problem of the original squeeze-and-excitation module.",
  "title": "CenterMask : Real-Time Anchor-Free Instance Segmentation",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "HOC",
  "full_name": "High-Order Consensuses",
  "description": "",
  "title": "Clusterability as an Alternative to Anchor Points When Learning with Noisy Labels",
  "collection": "Value Function Estimation",
  "area": "Reinforcement Learning"
}
{
  "name": "Harm-Net",
  "full_name": "Harm-Net",
  "description": "A **Harmonic Network**, or **Harm-Net**, is a type of convolutional neural network that replaces convolutional layers with \"harmonic blocks\" that use [Discrete Cosine Transform](https://paperswithcode.com/method/discrete-cosine-transform) (DCT) filters. These blocks can be useful in  truncating high-frequency information (possible due to the redundancies in the spectral domain).",
  "title": "Harmonic Convolutional Networks based on Discrete Cosine Transform",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "BiGAN",
  "full_name": "Bidirectional GAN",
  "description": "A **BiGAN**, or **Bidirectional GAN**, is a type of generative adversarial network where the generator  not only maps latent samples to generated data, but also has an inverse mapping from data to the latent representation. The motivation is to make a type of GAN that can learn rich representations for us in applications like unsupervised learning.\r\n\r\nIn addition to the generator $G$ from the standard [GAN](https://paperswithcode.com/method/gan) framework, BiGAN includes an encoder $E$ which maps data $\\mathbf{x}$ to latent representations $\\mathbf{z}$. The BiGAN discriminator $D$ discriminates not only in data space ($\\mathbf{x}$ versus $G\\left(\\mathbf{z}\\right)$), but jointly in data and latent space (tuples $\\left(\\mathbf{x}, E\\left(\\mathbf{x}\\right)\\right)$ versus $\\left(G\\left(z\\right), z\\right)$), where the latent component is either an encoder output $E\\left(\\mathbf{x}\\right)$ or a generator input $\\mathbf{z}$.",
  "title": "Adversarial Feature Learning",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "pixel2style2pixel",
  "full_name": "pixel2style2pixel",
  "description": "**Pixel2Style2Pixel**, or **pSp**, is an image-to-image translation framework that is based on a novel encoder that directly generates a series of style vectors which are fed into a pretrained [StyleGAN](https://paperswithcode.com/method/stylegan) generator, forming the extended $\\mathcal{W+}$ latent space. Feature maps are first extracted using a standard feature pyramid over a [ResNet](https://paperswithcode.com/method/resnet) backbone. Then, for each of $18$ target styles, a small mapping network is trained to extract the learned styles from the corresponding feature map, where styles $(0-2)$ are generated from the small feature map, $(3-6)$ from the medium feature map, and $(7-18)$ from the largest feature map. The mapping network, map2style, is a small fully convolutional network, which gradually reduces spatial size using a set of 2-strided convolutions followed by [LeakyReLU](https://paperswithcode.com/method/leaky-relu) activations. Each generated 512 vector, is fed into [StyleGAN](https://paperswithcode.com/method/stylegan), starting from its matching affine transformation, $A$.",
  "title": "Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation",
  "collection": "Unpaired Image-to-Image Translation",
  "area": "Computer Vision"
}
{
  "name": "L2M",
  "full_name": "Learning to Match",
  "description": "**L2M** is a learning algorithm that can work for most cross-domain distribution matching tasks. It automatically learns the cross-domain distribution matching without relying on hand-crafted priors on the matching loss. Instead, L2M reduces the inductive bias by using a meta-network to learn the distribution matching loss in a data-driven way.",
  "title": "Learning to Match Distributions for Domain Adaptation",
  "collection": "Domain Adaptation",
  "area": "General"
}
{
  "name": "IAN",
  "full_name": "Introspective Adversarial Network",
  "description": "The **Introspective Adversarial Network (IAN)** is a hybridization of [GANs](https://paperswithcode.com/method/gan) and [VAEs](https://paperswithcode.com/method/vae) that leverages the power of the adversarial objective while maintaining the VAE’s efficient inference mechanism. It uses the discriminator of the GAN, $D$, as a feature extractor for an inference subnetwork, $E$, which is implemented as a fully-connected layer on top of the final convolutional layer of the discriminator. We infer latent values $Z \\sim E\\left(X\\right) = q\\left(Z\\mid{X}\\right)$ for reconstruction and sample random values $Z \\sim p\\left(Z\\right)$ from a standard normal for random image generation using the generator network, $G$.\r\n\r\nThree distinct loss functions are used:\r\n\r\n- $\\mathcal{L}\\_{img}$, the L1 pixel-wise reconstruction loss, which is preferred to the L2 reconstruction loss for its higher average gradient.\r\n- $\\mathcal{L\\_{feature}}$, the feature-wise reconstruction loss, evaluated as the L2 difference between the original and reconstruction in the space of the hidden layers of the discriminator.\r\n- $\\mathcal{L}\\_{adv}$, the ternary adversarial loss, a modification of the adversarial loss that forces the discriminator to label a sample as real, generated, or reconstructed (as opposed to a binary\r\nreal vs. generated label).\r\n\r\nIncluding the VAE’s KL divergence between the inferred latents $E\\left(X\\right)$ and the prior $p\\left(Z\\right)$, the loss function for the generator and encoder network is thus:\r\n\r\n$$\\mathcal{L}\\_{E, G} = \\lambda\\_{adv}\\mathcal{L}\\_{G\\_{adv}} + \\lambda\\_{img}\\mathcal{L}\\_{img}  + \\lambda\\_{feature}\\mathcal{L}\\_{feature}  + D\\_{KL}\\left(E\\left(X\\right) || p\\left(Z\\right)\\right) $$\r\n\r\nWhere the $\\lambda$ terms weight the relative importance of each loss. We set $\\lambda\\_{img}$ to 3 and leave the other terms at 1. The discriminator is updated solely using the ternary adversarial loss. During each training step, the generator produces reconstructions $G\\left(E\\left(X\\right)\\right)$ (using the standard VAE reparameterization trick) from data $X$ and random samples $G\\left(Z\\right)$, while the discriminator observes $X$ as well as the reconstructions and random samples, and both networks are simultaneously updated.",
  "title": "Neural Photo Editing with Introspective Adversarial Networks",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Fisher-BRC",
  "full_name": "Fisher-BRC",
  "description": "**Fisher-BRC** is an actor critic algorithm for offline reinforcement learning that encourages the learned policy to stay close to the data, namely parameterizing the critic as the $\\log$-behavior-policy, which generated the offline dataset, plus a state-action value offset term, which can be learned using a neural network. Behavior regularization then corresponds to an appropriate regularizer on the offset term. A gradient penalty regularizer is used for the offset term, which is equivalent to Fisher divergence regularization, suggesting connections to the score matching and generative energy-based model literature.",
  "title": "Offline Reinforcement Learning with Fisher Divergence Critic Regularization",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "DeepDrug",
  "full_name": "DeepDrug",
  "description": "**DeepDrug** is a deep learning framework to overcome these shortcomings by using graph convolutional networks to learn the graphical representations of drugs and proteins such as molecular fingerprints and residual structures in order to boost the prediction accuracy.",
  "title": "DeepDrug: A General Graph-Based Deep Learning Framework for Drug Relation Prediction",
  "collection": "Graph Representation Learning",
  "area": "Graphs"
}
{
  "name": "CvT",
  "full_name": "Convolutional Vision Transformer",
  "description": "The **Convolutional vision Transformer (CvT)** is an architecture which incorporates convolutions into the [Transformer](https://paperswithcode.com/method/transformer). The CvT design introduces convolutions to two core sections of the ViT architecture.\r\n\r\nFirst, the Transformers are partitioned into multiple stages that form a hierarchical structure of Transformers. The beginning of each stage consists of a convolutional token embedding that performs an overlapping [convolution](https://paperswithcode.com/method/convolution) operation with stride on a 2D-reshaped token map (i.e., reshaping flattened token sequences back to the spatial grid), followed by [layer normalization](https://paperswithcode.com/method/layer-normalization). This allows the model to not only capture local information, but also progressively decrease the sequence length while simultaneously increasing the dimension of token features across stages, achieving spatial downsampling while concurrently increasing the number of feature maps, as is performed in CNNs. \r\n\r\nSecond, the linear projection prior to every self-attention block in the Transformer module is replaced with a proposed convolutional projection, which employs a s × s depth-wise separable convolution operation on an 2D-reshaped token map. This allows the model to further capture local spatial context and reduce semantic ambiguity in the attention mechanism. It also permits management of computational complexity, as the stride of convolution can be used to subsample the key and value matrices to improve efficiency by 4× or more, with minimal degradation of performance.",
  "title": "CvT: Introducing Convolutions to Vision Transformers",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "ALBERT",
  "full_name": "ALBERT",
  "description": "**ALBERT** is a [Transformer](https://paperswithcode.com/method/transformer) architecture based on [BERT](https://paperswithcode.com/method/bert) but with much fewer parameters. It achieves this through two parameter reduction techniques. The first is a factorized embeddings parameterization. By decomposing the large vocabulary embedding matrix into two small matrices, the size of the hidden layers is separated from the size of vocabulary embedding. This makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embeddings. The second technique is cross-layer parameter sharing. This technique prevents the parameter from growing with the depth of the network. \r\n\r\nAdditionally, ALBERT utilises a self-supervised loss for sentence-order prediction (SOP). SOP primary focuses on inter-sentence coherence and is designed to address the ineffectiveness of the next sentence prediction (NSP) loss proposed in the original BERT.",
  "title": "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "WaveGrad DBlock",
  "full_name": "WaveGrad DBlock",
  "description": "**WaveGrad DBlocks** are used to downsample the temporal dimension of noisy waveform in [WaveGrad](https://paperswithcode.com/method/wavegrad). They are similar to UBlocks except that only one [residual block](https://paperswithcode.com/method/residual-block) is included. The dilation factors are 1, 2, 4 in the main branch. Orthogonal initialization is used.",
  "title": "WaveGrad: Estimating Gradients for Waveform Generation",
  "collection": "Audio Model Blocks",
  "area": "Audio"
}
{
  "name": "Stochastic Dueling Network",
  "full_name": "Stochastic Dueling Network",
  "description": "A **Stochastic Dueling Network**, or **SDN**, is an architecture for learning a value function $V$. The SDN learns both $V$ and $Q$ off-policy while maintaining consistency between the two estimates. At each time step it outputs a stochastic estimate of $Q$ and a deterministic estimate of $V$.",
  "title": "Sample Efficient Actor-Critic with Experience Replay",
  "collection": "Value Function Estimation",
  "area": "Reinforcement Learning"
}
{
  "name": "TResNet",
  "full_name": "TResNet",
  "description": "A **TResNet** is a variant on a [ResNet](https://paperswithcode.com/method/resnet) that aim to boost accuracy while maintaining GPU training and inference efficiency.  They contain several design tricks including a SpaceToDepth stem, [Anti-Alias downsampling](https://paperswithcode.com/method/anti-alias-downsampling), In-Place Activated BatchNorm, Blocks selection and squeeze-and-excitation layers.",
  "title": "TResNet: High Performance GPU-Dedicated Architecture",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "LGCL",
  "full_name": "Learnable graph convolutional layer",
  "description": "Learnable graph convolutional layer (LGCL) automatically selects a fixed number of neighboring nodes for each feature based on value ranking in order to transform graph data into grid-like structures in 1-D format, thereby enabling the use of regular convolutional operations on generic graphs.\r\n\r\nDescription and image from: [Large-Scale Learnable Graph Convolutional Networks](https://arxiv.org/pdf/1808.03965.pdf)",
  "title": "Large-Scale Learnable Graph Convolutional Networks",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "CAMoE",
  "full_name": "CAMoE",
  "description": "**CAMoE** is a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (MoE) for video-text retrieval. The CAMoE employs Mixture-of-Experts (MoE) to extract multi-perspective video representations, including action, entity, scene, etc., then align them with the corresponding part of the text. A [Dual Softmax Loss](https://paperswithcode.com/method/dual-softmax-loss) (DSL) is used to avoid the one-way optimum-match which occurs in previous contrastive methods. Introducing the intrinsic prior of each pair in a batch, DSL serves as a reviser to correct the similarity matrix and achieves the dual optimal match.",
  "title": "Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss",
  "collection": "Video-Text Retrieval Models",
  "area": "Computer Vision"
}
{
  "name": "Bort",
  "full_name": "Bort",
  "description": "**Bort** is a parametric architectural variant of the [BERT](https://paperswithcode.com/method/bert) architecture. It extracts an optimal subset of architectural parameters for the BERT architecture through a [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) approach; in particular, a fully polynomial-time approximation scheme (FPTAS). This optimal subset - “Bort” - is demonstrably smaller, having an effective size of $5.5 \\%$ the original BERT-large architecture, and $16\\%$ of the net size. Bort is also able to be pretrained in $288$ GPU hours, which is $1.2\\%$ less than the time required to pretrain the highest-performing BERT parametric architecture variant, RoBERTa-large ([RoBERTa](https://paperswithcode.com/method/roberta)), and about $33\\%",
  "title": "Optimal Subarchitecture Extraction For BERT",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "Gradient Normalization",
  "full_name": "Gradient Normalization",
  "description": "**Gradient Normalization** is a normalization method for [Generative Adversarial Networks](https://paperswithcode.com/methods/category/generative-adversarial-networks) to tackle the training instability of generative adversarial networks caused by the sharp gradient space. Unlike existing work such as [gradient penalty](https://paperswithcode.com/method/wgan-gp-loss) and [spectral normalization](https://paperswithcode.com/method/spectral-normalization), the proposed GN only imposes a hard 1-Lipschitz constraint on the discriminator function, which increases the capacity of the network.",
  "title": "Gradient Normalization for Generative Adversarial Networks",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "ControlVAE",
  "full_name": "ControlVAE",
  "description": "**ControlVAE** is a [variational autoencoder](https://paperswithcode.com/method/vae) (VAE) framework that combines the automatic control theory with the basic VAE to stabilize the KL-divergence of VAE models to a specified value. It leverages a non-linear PI controller, a variant of the proportional-integral-derivative (PID) control, to dynamically tune the weight of the KL-divergence term in the evidence lower bound (ELBO) using the output KL-divergence as feedback. This allows for control of the KL-divergence to a desired value (set point), which is effective in avoiding posterior collapse and learning disentangled representations.",
  "title": "ControlVAE: Tuning, Analytical Properties, and Performance Analysis",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Mish",
  "full_name": "Mish",
  "description": "**Mish** is an activation function for neural networks which can be defined as:\r\n\r\n$$ f\\left(x\\right) = x\\cdot\\tanh{\\text{softplus}\\left(x\\right)}$$\r\n\r\nwhere\r\n\r\n$$\\text{softplus}\\left(x\\right) = \\ln\\left(1+e^{x}\\right)$$\r\n\r\n(Compare with functionally similar previously proposed activation functions such as the [GELU](https://paperswithcode.com/method/silu) $x\\Phi(x)$ and the [SiLU](https://paperswithcode.com/method/silu) $x\\sigma(x)$.)",
  "title": "Mish: A Self Regularized Non-Monotonic Activation Function",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "ZCA Whitening",
  "full_name": "ZCA Whitening",
  "description": "**ZCA Whitening** is an image preprocessing method that leads to a transformation of data such that the covariance matrix $\\Sigma$ is the identity matrix, leading to decorrelated features.\r\n\r\nImage Source: [Alex Krizhevsky](http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf)",
  "title": null,
  "collection": "Whitening",
  "area": "Computer Vision"
}
{
  "name": "Shifted Softplus",
  "full_name": "Shifted Softplus",
  "description": "**Shifted Softplus** is an activation function ${\\rm ssp}(x) = \\ln( 0.5 e^{x} + 0.5 )$, which [SchNet](https://paperswithcode.com/method/schnet) employs as non-linearity throughout the network in order to obtain a smooth potential energy surface. The shifting ensures that ${\\rm ssp}(0) = 0$ and improves the convergence of the network. This activation function shows similarity to ELUs, while having infinite order of continuity.",
  "title": "SchNet: A continuous-filter convolutional neural network for modeling quantum interactions",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Latent Diffusion Model",
  "full_name": "Latent Diffusion Model",
  "description": "Diffusion models applied to latent spaces, which are normally built with (Variational) Autoencoders.",
  "title": "High-Resolution Image Synthesis with Latent Diffusion Models",
  "collection": "Dimensionality Reduction",
  "area": "General"
}
{
  "name": "SMOTE",
  "full_name": "Synthetic Minority Over-sampling Technique.",
  "description": "Perhaps the most widely used approach to synthesizing new examples is called the Synthetic Minority Oversampling Technique, or SMOTE for short. This technique was described by Nitesh Chawla, et al. in their 2002 paper named for the technique titled “SMOTE: Synthetic Minority Over-sampling Technique.”\r\n\r\nSMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.",
  "title": "SMOTE: Synthetic Minority Over-sampling Technique",
  "collection": "Downsampling",
  "area": "Computer Vision"
}
{
  "name": "GLU",
  "full_name": "Gated Linear Unit",
  "description": "A **Gated Linear Unit**, or **GLU** computes:\r\n\r\n$$ \\text{GLU}\\left(a, b\\right) = a\\otimes \\sigma\\left(b\\right) $$\r\n\r\nIt is used in natural language processing architectures, for example the [Gated CNN](https://paperswithcode.com/method/gated-convolution-network), because here $b$ is the gate that control what information from $a$ is passed up to the following layer. Intuitively, for a language modeling task, the gating mechanism allows selection of words or features that are important for predicting the next word. The GLU also has non-linear capabilities, but has a linear path for the gradient so diminishes the vanishing gradient problem.",
  "title": "Language Modeling with Gated Convolutional Networks",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "PowerSGD",
  "full_name": "PowerSGD",
  "description": "**PowerSGD** is a distributed optimization technique that computes a low-rank approximation of the gradient using a generalized power iteration (known as subspace iteration). The approximation is computationally light-weight, avoiding any prohibitively expensive Singular Value Decomposition. To improve the quality of the efficient approximation, the authors warm-start the power iteration by reusing the approximation from the previous optimization step.",
  "title": "PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Average Pooling",
  "full_name": "Average Pooling",
  "description": "**Average Pooling** is a pooling operation that calculates the average value for patches of a feature map, and uses it to create a downsampled (pooled) feature map. It is usually used after a convolutional layer. It adds a small amount of translation invariance - meaning translating the image by a small amount does not significantly affect the values of most pooled outputs. It extracts features more smoothly than [Max Pooling](https://paperswithcode.com/method/max-pooling), whereas max pooling extracts more pronounced features like edges.\r\n\r\nImage Source: [here](https://www.researchgate.net/figure/Illustration-of-Max-Pooling-and-Average-Pooling-Figure-2-above-shows-an-example-of-max_fig2_333593451)",
  "title": null,
  "collection": "Pooling Operations",
  "area": "Computer Vision"
}
{
  "name": "RetinaNet",
  "full_name": "RetinaNet",
  "description": "**RetinaNet** is a one-stage object detection model that utilizes a [focal loss](https://paperswithcode.com/method/focal-loss) function to address class imbalance during training. Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard negative examples. RetinaNet is a single, unified network composed of a *backbone* network and two task-specific *subnetworks*. The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-self convolutional network. The first subnet performs convolutional object classification on the backbone's output; the second subnet performs convolutional bounding box regression. The two subnetworks feature a simple design that the authors propose specifically for one-stage, dense detection. \r\n\r\nWe can see the motivation for focal loss by comparing with two-stage object detectors. Here class imbalance is addressed by a two-stage cascade and sampling heuristics. The proposal stage (e.g., [Selective Search](https://paperswithcode.com/method/selective-search), [EdgeBoxes](https://paperswithcode.com/method/edgeboxes), [DeepMask](https://paperswithcode.com/method/deepmask), [RPN](https://paperswithcode.com/method/rpn)) rapidly narrows down the number of candidate object locations to a small number (e.g., 1-2k), filtering out most background samples. In the second classification stage, sampling heuristics, such as a fixed foreground-to-background ratio, or online hard example mining ([OHEM](https://paperswithcode.com/method/ohem)), are performed to maintain a\r\nmanageable balance between foreground and background.\r\n\r\nIn contrast, a one-stage detector must process a much larger set of candidate object locations regularly sampled across an image. To tackle this, RetinaNet uses a focal loss function, a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples. \r\n\r\nFormally, the Focal Loss adds a factor $(1 - p\\_{t})^\\gamma$ to the standard cross entropy criterion. Setting $\\gamma>0$ reduces the relative loss for well-classified examples ($p\\_{t}>.5$), putting more focus on hard, misclassified examples. Here there is tunable *focusing* parameter $\\gamma \\ge 0$. \r\n\r\n$$ {\\text{FL}(p\\_{t}) = - (1 - p\\_{t})^\\gamma \\log\\left(p\\_{t}\\right)} $$",
  "title": "Focal Loss for Dense Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "Lambda Layer",
  "full_name": "Lambda Layer",
  "description": "**Lambda layers** are a building block for modeling long-range dependencies in data. They consist of long-range interactions between a query and a structured set of context elements at a reduced memory cost. Lambda layers transform each available context into a linear function, termed a lambda, which is then directly applied to the corresponding query. Whereas self-attention defines a similarity kernel between the query and the context elements, a lambda layer instead summarizes contextual information into a fixed-size linear function (i.e. a matrix), thus bypassing the need for memory-intensive attention maps.",
  "title": "LambdaNetworks: Modeling Long-Range Interactions Without Attention",
  "collection": "Long-Range Interaction Layers",
  "area": "General"
}
{
  "name": "GraphSAGE",
  "full_name": "GraphSAGE",
  "description": "GraphSAGE is a general inductive framework that leverages node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data.\r\n\r\nImage from: [Inductive Representation Learning on Large Graphs](https://arxiv.org/pdf/1706.02216v4.pdf)",
  "title": "Inductive Representation Learning on Large Graphs",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "StyleSwin",
  "full_name": "StyleSwin: Transformer-based GAN for High-resolution Image Generation",
  "description": "Despite the tantalizing success in a broad of vision tasks, transformers have not yet demonstrated on-par ability as ConvNets in high-resolution image generative modeling. In this paper, we seek to explore using pure transformers to build a generative adversarial network for high-resolution image synthesis. To this end, we believe that local attention is crucial to strike the balance between computational efficiency and modeling capacity. Hence, the proposed generator adopts Swin transformer in a style-based architecture. To achieve a larger receptive field, we propose double attention which simultaneously leverages the context of the local and the shifted windows, leading to improved generation quality. Moreover, we show that offering the knowledge of the absolute position that has been lost in window-based transformers greatly benefits the generation quality. The proposed StyleSwin is scalable to high resolutions, with both the coarse geometry and fine structures benefit from the strong expressivity of transformers. However, blocking artifacts occur during high-resolution synthesis because performing the local attention in a block-wise manner may break the spatial coherency. To solve this, we empirically investigate various solutions, among which we find that employing a wavelet discriminator to examine the spectral discrepancy effectively suppresses the artifacts. Extensive experiments show the superiority over prior transformer-based GANs, especially on high resolutions, e.g., 1024x1024. The StyleSwin, without complex training strategies, excels over StyleGAN on CelebA-HQ 1024x1024, and achieves on-par performance on FFHQ 1024x1024, proving the promise of using transformers for high-resolution image generation.",
  "title": null,
  "collection": "Generative Adversarial Networks",
  "area": "Computer Vision"
}
{
  "name": "STD",
  "full_name": "Spatial-Channel Token Distillation",
  "description": "The **Spatial-Channel Token Distillation** method is proposed to improve the spatial and channel mixing from a novel knowledge distillation (KD) perspective. To be specific, we design a special KD mechanism for MLP-like Vision Models called Spatial-channel Token Distillation (STD), which improves the information mixing in both the spatial and channel dimensions of MLP blocks. Instead of modifying the mixing operations themselves, STD adds spatial and channel tokens to image patches. After forward propagation, the tokens are concatenated for distillation with the teachers’ responses as targets. Each token works as an aggregator of its dimension. The objective of them is to encourage each mixing operation to extract maximal task-related information from their specific dimension.",
  "title": "Spatial-Channel Token Distillation for Vision MLPs",
  "collection": "Knowledge Distillation",
  "area": "General"
}
{
  "name": "ERNIE",
  "full_name": "ERNIE",
  "description": "ERNIE is a transformer-based model consisting of two stacked modules: 1) textual encoder and 2) knowledgeable encoder, which is responsible to integrate extra token-oriented knowledge information into textual information. This layer consists of stacked aggregators, designed for encoding both tokens and entities as well as fusing their heterogeneous features. To integrate this layer of enhancing representations via knowledge, a special pre-training task is adopted for ERNIE - it involves randomly masking token-entity alignments and training the model to predict all corresponding entities based on aligned tokens (aka denoising entity auto-encoder).",
  "title": "ERNIE: Enhanced Representation through Knowledge Integration",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "RSE",
  "full_name": "Residual Shuffle-Exchange Network",
  "description": "**Residual Shuffle-Exchange Network** is an efficient alternative to models using an attention mechanism that allows the modelling of long-range dependencies in sequences in O(n log n) time. This model achieved state-of-the-art performance on the MusicNet dataset for music transcription while being able to run inference on a single GPU fast enough to be suitable for real-time audio processing.",
  "title": "Residual Shuffle-Exchange Networks for Fast Processing of Long Sequences",
  "collection": "Music Transcription",
  "area": "Audio"
}
{
  "name": "NAM",
  "full_name": "Neural Additive Model",
  "description": "**Neural Additive Models (NAMs)** make restrictions on the structure of neural networks, which yields a family of models that are inherently interpretable while suffering little loss in prediction accuracy when applied to tabular data. Methodologically, NAMs belong to a larger model family called Generalized Additive Models (GAMs). \r\n\r\nNAMs learn a linear combination of networks that each attend to a single input feature: each $f\\_{i}$ in the traditional GAM formulationis parametrized by a neural network. These networks are trained jointly using backpropagation and can learn arbitrarily complex shape functions. Interpreting NAMs is easy as the impact of a feature on the prediction does not rely on the other features and can be understood by visualizing its corresponding shape function (e.g., plotting $f\\_{i}\\left(x\\_{i}\\right)$ vs. $x\\_{i}$).",
  "title": "Neural Additive Models: Interpretable Machine Learning with Neural Nets",
  "collection": "Generalized Additive Models",
  "area": "General"
}
{
  "name": "I3DR-Net",
  "full_name": "Inflated 3D ConvNet Retina Net",
  "description": "",
  "title": "RetinaNet Object Detector based on Analog-to-Spiking Neural Network Conversion",
  "collection": "3D Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "Soft Actor Critic",
  "full_name": "Soft Actor Critic",
  "description": "**Soft Actor Critic**, or **SAC**, is an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as [Q-learning methods](https://paperswithcode.com/method/q-learning). [SAC](https://paperswithcode.com/method/sac) combines off-policy updates with a stable stochastic actor-critic formulation.\r\n\r\nThe SAC objective has a number of advantages. First, the policy is incentivized to explore more widely, while giving up on clearly unpromising avenues. Second, the policy can capture multiple modes of near-optimal behavior. In problem settings where multiple actions seem equally attractive, the policy will commit equal probability mass to those actions. Lastly, the authors present evidence that it improves learning speed over state-of-art methods that optimize the conventional RL objective function.",
  "title": "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "FastPitch",
  "full_name": "FastPitch",
  "description": "**FastPitch** is a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The architecture of FastPitch is shown in the Figure. It is based on FastSpeech and composed mainly of two feed-forward [Transformer](https://paperswithcode.com/method/transformer) (FFTr) stacks. The first one operates in the resolution of input tokens, the second one in the resolution of the output frames. Let $x=\\left(x\\_{1}, \\ldots, x\\_{n}\\right)$ be the sequence of input lexical units, and $\\mathbf{y}=\\left(y\\_{1}, \\ldots, y\\_{t}\\right)$ be the sequence of target mel-scale spectrogram frames. The first FFTr stack produces the hidden representation $\\mathbf{h}=\\operatorname{FFTr}(\\mathbf{x})$. The hidden representation $h$ is used to make predictions about the duration and average pitch of every character with a 1-D CNN \r\n\r\n$$\r\n\\hat{\\mathbf{d}}=\\text { DurationPredictor }(\\mathbf{h}), \\quad \\hat{\\mathbf{p}}=\\operatorname{PitchPredictor}(\\mathbf{h})\r\n$$\r\n\r\nwhere $\\hat{\\mathbf{d}} \\in \\mathbb{N}^{n}$ and $\\hat{\\mathbf{p}} \\in \\mathbb{R}^{n}$. Next, the pitch is projected to match the dimensionality of the hidden representation $h \\in$ $\\mathbb{R}^{n \\times d}$ and added to $\\mathbf{h}$. The resulting sum $\\mathbf{g}$ is discretely upsampled and passed to the output FFTr, which produces the output mel-spectrogram sequence\r\n\r\n$$\r\n\\mathbf{g}=\\mathbf{h}+\\operatorname{PitchEmbedding}(\\mathbf{p})\r\n$$\r\n\r\n$$\r\n\\hat{\\mathbf{y}}=\\operatorname{FFTr}\\left([\\underbrace{g\\_{1}, \\ldots, g\\_{1}}\\_{d\\_{1}}, \\ldots \\underbrace{g\\_{n}, \\ldots, g\\_{n}}_{d\\_{n}}]\\right)\r\n$$\r\n\r\n\r\nGround truth $\\mathbf{p}$ and $\\mathbf{d}$ are used during training, and predicted $\\hat{\\mathbf{p}}$ and $\\hat{\\mathbf{d}}$ are used during inference. The model optimizes mean-squared error (MSE) between the predicted and ground-truth modalities\r\n\r\n$$\r\n\\mathcal{L}=\\|\\hat{\\mathbf{y}}-\\mathbf{y}\\|\\_{2}^{2}+\\alpha\\|\\hat{\\mathbf{p}}-\\mathbf{p}\\|\\_{2}^{2}+\\gamma\\|\\hat{\\mathbf{d}}-\\mathbf{d}\\|\\_{2}^{2}\r\n$$",
  "title": "FastPitch: Parallel Text-to-speech with Pitch Prediction",
  "collection": "Text-to-Speech Models",
  "area": "Audio"
}
{
  "name": "Quick Attention",
  "full_name": "Quick Attention",
  "description": "\\begin{equation}\r\nQA\\left( x \\right) = \\sigma\\left( f\\left( x \\right)^{1x1} \\right) + x \\end{equation}\r\n\r\nQuick Attention takes in the feature map as an input WxHxC (Width x Height x Channels) and creates two instances of the input feature map then it performs the 1x1xC convolution on the first instance and calculates the sigmoid activations after that it is added with the second instance to generate the final attention map as output which is of same dimensions as of input.",
  "title": "HistoSeg : Quick attention with multi-loss function for multi-structure segmentation in digital histology images",
  "collection": "Attention",
  "area": "General"
}
{
  "name": "PIRL",
  "full_name": "PIRL",
  "description": "**Pretext-Invariant Representation Learning (PIRL, pronounced as “pearl”)** learns invariant representations based on pretext tasks. PIRL is used with a commonly used pretext task that involves solving [jigsaw](https://paperswithcode.com/method/jigsaw) puzzles. Specifically, PIRL constructs image representations that are similar to the representation of transformed versions of the same image and different from the representations of other images.",
  "title": "Self-Supervised Learning of Pretext-Invariant Representations",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "Sparsemax",
  "full_name": "Sparsemax",
  "description": "**Sparsemax** is a type of activation/output function similar to the traditional [softmax](https://paperswithcode.com/method/softmax), but able to output sparse probabilities. \r\n\r\n$$ \\text{sparsemax}\\left(z\\right) = \\arg\\_{p∈\\Delta^{K−1}}\\min||\\mathbf{p} - \\mathbf{z}||^{2} $$",
  "title": "From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification",
  "collection": "Output Functions",
  "area": "General"
}
{
  "name": "Drafting Network",
  "full_name": "Drafting Network",
  "description": "**Drafting Network** is a style transfer module designed to transfer global style patterns in low-resolution, since global patterns can be transferred easier in low resolution due to larger receptive field and less local details. To achieve single style transfer, earlier work trained an encoder-decoder module, where only the content image is used as input. To better combine the style feature and the content feature, the Drafting Network adopts the [AdaIN module](https://paperswithcode.com/method/adaptive-instance-normalization).\r\n\r\nThe architecture of Drafting Network is shown in the Figure, which includes an encoder, several AdaIN modules and a decoder. (1) The encoder is a pre-trained [VGG](https://paperswithcode.com/method/vgg)-19 network, which is fixed during training. Given $\\bar{x}\\_{c}$ and $\\bar{x}\\_{s}$, the VGG encoder extracts features in multiple granularity at 2_1, 3_1 and 4_1 layers. (2) Then, we apply feature modulation between the content and style feature using AdaIN modules after 2_1, 3_1 and 4_1 layers, respectively. (3) Finally, in each granularity of decoder, the corresponding feature from the AdaIN module is merged via a [skip-connection](https://paperswithcode.com/methods/category/skip-connections). Here, skip-connections after AdaIN modules in both low and high levels are leveraged to help to reserve content structure, especially for low-resolution image.",
  "title": "Drafting and Revision: Laplacian Pyramid Network for Fast High-Quality Artistic Style Transfer",
  "collection": "Style Transfer Modules",
  "area": "Computer Vision"
}
{
  "name": "AdaMax",
  "full_name": "AdaMax",
  "description": "**AdaMax** is a generalisation of [Adam](https://paperswithcode.com/method/adam) from the $l\\_{2}$ norm to the $l\\_{\\infty}$ norm. Define:\r\n\r\n$$ u\\_{t} = \\beta^{\\infty}\\_{2}v\\_{t-1} + \\left(1-\\beta^{\\infty}\\_{2}\\right)|g\\_{t}|^{\\infty}$$\r\n\r\n$$ = \\max\\left(\\beta\\_{2}\\cdot{v}\\_{t-1}, |g\\_{t}|\\right)$$\r\n\r\nWe can plug into the Adam update equation by replacing $\\sqrt{\\hat{v}_{t} + \\epsilon}$ with $u\\_{t}$ to obtain the AdaMax update rule:\r\n\r\n$$ \\theta\\_{t+1} = \\theta\\_{t} - \\frac{\\eta}{u\\_{t}}\\hat{m}\\_{t} $$\r\n\r\nCommon default values are $\\eta = 0.002$ and $\\beta\\_{1}=0.9$ and $\\beta\\_{2}=0.999$.",
  "title": "Adam: A Method for Stochastic Optimization",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "PWIL",
  "full_name": "Primal Wasserstein Imitation Learning",
  "description": "**Primal Wasserstein Imitation Learning**, or **PWIL**, is a method for imitation learning which ties to the primal form of the Wasserstein distance between the expert and the agent state-action distributions. The reward function is derived offline, as opposed to recent adversarial IL algorithms that learn a reward function through interactions with the environment, and requires little fine-tuning.",
  "title": "Primal Wasserstein Imitation Learning",
  "collection": "Imitation Learning Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "SCA-CNN",
  "full_name": "Spatial and Channel-wise Attention-based Convolutional Neural Network",
  "description": "As CNN features are naturally spatial, channel-wise and multi-layer, \r\nChen et al. proposed a novel spatial and channel-wise attention-based convolutional neural network (SCA-CNN). \r\nIt was designed for the task of image captioning, and uses an encoder-decoder framework where a CNN first encodes an input image into a vector and then an LSTM decodes the vector into a sequence of words. Given an input feature map $X$ and the previous time step LSTM hidden state $h_{t-1} \\in \\mathbb{R}^d$, a spatial attention mechanism pays more attention to the semantically useful regions, guided by LSTM hidden state $h_{t-1}$. The  spatial attention model is:\r\n\r\n\\begin{align}\r\na(h_{t-1}, X) &= \\tanh(Conv_1^{1 \\times 1}(X) \\oplus W_1 h_{t-1})\r\n\\end{align}\r\n\r\n\\begin{align}\r\n\\Phi_s(h_{t-1}, X) &= \\text{Softmax}(Conv_2^{1 \\times 1}(a(h_{t-1}, X)))    \r\n\\end{align}\r\n\r\nwhere $\\oplus$ represents  addition of a matrix and a vector. Similarly, channel-wise attention aggregates global information first, and then computes a channel-wise attention weight vector with the hidden state $h_{t-1}$:\r\n\\begin{align}\r\nb(h_{t-1}, X) &= \\tanh((W_2\\text{GAP}(X)+b_2)\\oplus W_1h_{t-1})\r\n\\end{align}\r\n\\begin{align}\r\n\\Phi_c(h_{t-1}, X) &= \\text{Softmax}(W_3(b(h_{t-1}, X))+b_3)    \r\n\\end{align}\r\nOverall, the  SCA mechanism can be written in one of two ways. If channel-wise attention is applied before spatial attention, we have\r\n\\begin{align}\r\nY &= f(X,\\Phi_s(h_{t-1}, X \\Phi_c(h_{t-1}, X)), \\Phi_c(h_{t-1}, X)) \r\n\\end{align}\r\nand  if spatial attention comes first:\r\n\\begin{align}\r\nY &= f(X,\\Phi_s(h_{t-1}, X), \\Phi_c(h_{t-1}, X \\Phi_s(h_{t-1}, X)))\r\n\\end{align}\r\nwhere $f(\\cdot)$ denotes the modulate function which takes the feature map $X$ and attention maps as input and then outputs the modulated feature map $Y$.\r\n\r\nUnlike previous attention mechanisms which consider each image region equally and use global spatial information to tell the network where to focus, SCA-Net leverages the semantic vector to produce the spatial attention map as well as the channel-wise attention weight vector. Being more than a powerful attention model, SCA-CNN also provides a better understanding of where and what the model should focus on during sentence generation.",
  "title": "SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Spectral Dropout",
  "full_name": "Spectral Dropout",
  "description": "Please enter a description about the method here",
  "title": "Regularization of Deep Neural Networks with Spectral Dropout",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "DeltaConv",
  "full_name": "DeltaConv",
  "description": "Anisotropic convolution is a central building block of CNNs but challenging to transfer to surfaces. DeltaConv learns combinations and compositions of operators from vector calculus, which are a natural fit for curved surfaces. The result is a simple and robust anisotropic convolution operator for point clouds with state-of-the-art results.",
  "title": "DeltaConv: Anisotropic Operators for Geometric Deep Learning on Point Clouds",
  "collection": "3D Representations",
  "area": "Computer Vision"
}
{
  "name": "Axial Attention",
  "full_name": "Axial Attention",
  "description": "**Axial Attention** is a simple generalization of self-attention that naturally aligns with the multiple dimensions of the tensors in both the encoding and the decoding settings. It was first proposed in [CCNet](https://paperswithcode.com/method/ccnet) [1] named as criss-cross attention, which harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Ho et al [2] extents CCNet to process multi-dimensional data.  The proposed structure of the layers allows for the vast majority of the context to be computed in parallel during decoding without introducing any independence assumptions. It serves as the basic building block for developing self-attention-based autoregressive models for high-dimensional data tensors, e.g., Axial Transformers. It has been applied in [AlphaFold](https://paperswithcode.com/method/alphafold) [3] for interpreting protein sequences.\r\n\r\n[1] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, Wenyu Liu. CCNet: Criss-Cross Attention for Semantic Segmentation. ICCV, 2019.\r\n\r\n[2] Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, Tim Salimans. arXiv:1912.12180\r\n\r\n[3] Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Jul 15:1-1.",
  "title": "Axial Attention in Multidimensional Transformers",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "AdaBound",
  "full_name": "AdaBound",
  "description": "**AdaBound** is a variant of the [Adam](https://paperswithcode.com/method/adabound) stochastic optimizer which is designed to be more robust to extreme learning rates. Dynamic bounds are employed on learning rates, where the lower and upper bound are initialized as zero and infinity respectively, and they both smoothly converge to a constant final step size. AdaBound can be regarded as an adaptive method at the beginning of training, and thereafter it gradually and smoothly transforms to [SGD](https://paperswithcode.com/method/sgd) (or with momentum) as the time step increases. \r\n\r\n$$ g\\_{t} = \\nabla{f}\\_{t}\\left(x\\_{t}\\right) $$\r\n\r\n$$ m\\_{t} = \\beta\\_{1t}m\\_{t-1} + \\left(1-\\beta\\_{1t}\\right)g\\_{t} $$\r\n\r\n$$ v\\_{t} = \\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g\\_{t}^{2} \\text{ and } V\\_{t} = \\text{diag}\\left(v\\_{t}\\right) $$\r\n\r\n$$ \\hat{\\eta}\\_{t} = \\text{Clip}\\left(\\alpha/\\sqrt{V\\_{t}}, \\eta\\_{l}\\left(t\\right), \\eta\\_{u}\\left(t\\right)\\right) \\text{ and } \\eta\\_{t} = \\hat{\\eta}\\_{t}/\\sqrt{t} $$\r\n\r\n$$ x\\_{t+1} = \\Pi\\_{\\mathcal{F}, \\text{diag}\\left(\\eta\\_{t}^{-1}\\right)}\\left(x\\_{t} - \\eta\\_{t} \\odot m\\_{t} \\right) $$\r\n\r\nWhere $\\alpha$ is the initial step size, and $\\eta_{l}$ and $\\eta_{u}$ are the lower and upper bound functions respectively.",
  "title": "Adaptive Gradient Methods with Dynamic Bound of Learning Rate",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "ISPL",
  "full_name": "Implicit Subspace Prior Learning",
  "description": "**Implicit Subspace Prior Learning**, or **ISPL**, is a framework to approach dual-blind face restoration, with two major distinctions from previous restoration methods: 1) Instead of assuming an explicit degradation function between LQ and HQ domain, it establishes an implicit correspondence between both domains via a mutual embedding space, thus avoid solving the pathological inverse problem directly. 2) A subspace prior decomposition and fusion mechanism to dynamically handle inputs at varying degradation levels with consistent high-quality restoration results.",
  "title": "Implicit Subspace Prior Learning for Dual-Blind Face Restoration",
  "collection": "Face Restoration Models",
  "area": "Computer Vision"
}
{
  "name": "Recurrent Dropout",
  "full_name": "Recurrent Dropout",
  "description": "**Recurrent Dropout** is a regularization method for [recurrent neural networks](https://paperswithcode.com/methods/category/recurrent-neural-networks). [Dropout](https://paperswithcode.com/method/dropout) is applied to the updates to [LSTM](https://paperswithcode.com/method/lstm) memory cells (or [GRU](https://paperswithcode.com/method/gru) states), i.e. it drops out the input/update gate in LSTM/GRU.",
  "title": "Recurrent Dropout without Memory Loss",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "ShuffleNet V2 Block",
  "full_name": "ShuffleNet V2 Block",
  "description": "**ShuffleNet V2 Block** is an image model block used in the [ShuffleNet V2](https://paperswithcode.com/method/shufflenet-v2) architecture, where speed is the metric optimized for (instead of indirect ones like FLOPs). It utilizes a simple operator called channel split. At the beginning of each unit, the input of $c$ feature channels are split into two branches with $c - c'$ and $c'$ channels, respectively. Following **G3**, one branch remains as identity. The other branch consists of three convolutions with the same input and output channels to satisfy **G1**. The two $1\\times1$ convolutions are no longer group-wise, unlike the original [ShuffleNet](https://paperswithcode.com/method/shufflenet). This is partially to follow **G2**, and partially because the split operation already produces two groups. After [convolution](https://paperswithcode.com/method/convolution), the two branches are concatenated. So, the number of channels keeps the same (G1). The same “[channel shuffle](https://paperswithcode.com/method/channel-shuffle)” operation as in ShuffleNet is then used to enable information communication between the two branches.\r\n\r\nThe motivation behind channel split is that alternative architectures, where pointwise group convolutions and bottleneck structures are used, lead to increased memory access cost. Additionally more network fragmentation with group convolutions reduces parallelism (less friendly for GPU), and the element-wise addition operation, while they have low FLOPs, have high memory access cost. Channel split is an alternative where we can maintain a large number of equally wide channels (equally wide minimizes memory access cost) without having dense convolutions or too many groups.",
  "title": "ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "NCL",
  "full_name": "Neighborhood Contrastive Learning",
  "description": "",
  "title": "Neighborhood Contrastive Learning Applied to Online Patient Monitoring",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "TSRUs",
  "full_name": "TSRUs",
  "description": "**TSRUs**, or **Transformation-based Spatial Recurrent Unit p**, is a modification of a [ConvGRU](https://paperswithcode.com/method/cgru) used in the [TriVD-GAN](https://paperswithcode.com/method/trivd-gan) architecture for video generation.\r\n\r\nIt largely follows [TSRUc](https://paperswithcode.com/method/tsruc), but computes each intermediate output in a fully sequential manner: like in TSRUc, $c$ is given access to $\\hat{h}\\_{t-1}$, but additionally, $u$ is given access to both outputs $\\hat{h}\\_{t-1}$ and $c$, so as to make an informed decision prior to mixing. This yields the following replacement for $u$:\r\n\r\n$$ u = \\sigma\\left(W\\_{u} \\star\\_{n}\\left[\\hat{h}\\_{t-1};c\\right] + b\\_{u} \\right) $$\r\n\r\nIn these equations $\\sigma$ and $\\rho$ are the elementwise sigmoid and [ReLU](https://paperswithcode.com/method/relu) functions respectively and the $\\star\\_{n}$ represents a [convolution](https://paperswithcode.com/method/convolution) with a kernel of size $n \\times n$. Brackets are used to represent a feature concatenation.",
  "title": "Transformation-based Adversarial Video Prediction on Large-Scale Data",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "Dual Softmax Loss",
  "full_name": "Dual Softmax Loss",
  "description": "**Dual Softmax Loss** is a loss function based on symmetric cross-entropy loss used in the [CAMoE](https://paperswithcode.com/method/camoe) video-text retrieval model. Every text and video are calculated the\r\nsimilarity with other videos or texts, which should be maximum in terms of the ground truth pair. For DSL, a prior is introduced to revise the similarity score. Multiplying the prior with the original similarity matrix imposes an efficient constraint and can help to filter those single side match pairs. As a result, DSL highlights the one with both great Text-to-Video and Video-to-Text probability, conducting a more convincing result.",
  "title": "Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "CutMix",
  "full_name": "CutMix",
  "description": "**CutMix** is an image data augmentation strategy. Instead of simply removing pixels as in [Cutout](https://paperswithcode.com/method/cutout), we replace the removed regions with a patch from another image. The ground truth labels are also mixed proportionally to the number of pixels of combined images. The added patches further enhance localization ability by requiring the model to identify the object from a partial view.",
  "title": "CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "FEFM",
  "full_name": "Field Embedded Factorization Machine",
  "description": "**Field Embedded Factorization Machine**, or **FEFM**, is a factorization machine variant. For each field pair, FEFM introduces symmetric matrix embeddings along with the usual feature vector embeddings that are present in FM. Like FM, $v\\_{i}$ is the vector embedding of the $i^{t h}$ feature. However, unlike Field-Aware Factorization Machines (FFMs), FEFM doesn't explicitly learn field-specific feature embeddings. The learnable symmetric matrix $W\\_{F(i), F(j)}$ is the embedding for the field pair $F(i)$ and $F(j) .$ The interaction between the $i^{t h}$ feature and the $j^{t h}$ feature is mediated through $W_{F(i), F(j)} .$\r\n\r\n$$\r\n\\phi(\\theta, x)=\\phi\\_{F E F M}((w, v, W), x)=w\\_{0}+\\sum\\_{i=1}^{m} w_{i} x_{i}+\\sum\\_{i=1}^{m} \\sum\\_{j=i+1}^{m} v\\_{i}^{T} W\\_{F(i), F(j)} v\\_{j} x\\_{i} x\\_{j}\r\n$$\r\n\r\nwhere $W\\_{F(i), F(j)}$ is a $k \\times k$ symmetric matrix ( $k$ is the dimension of the feature vector embedding space containing feature vectors $v\\_{i}$ and $v\\_{j}$ ).\r\n\r\nThe symmetric property of the learnable matrix $W\\_{F(i), F(j)}$ is ensured by reparameterizing $W\\_{F(i), F(j)}$ as $U\\_{F(i), F(j)}+$ $U\\_{F(i), F(j)}^{T}$, where $U\\_{F(i), F(j)}^{T}$ is the transpose of the learnable matrix $U\\_{F(i), F(j)} .$ Note that $W_{F(i), F(j)}$ can also be interpreted as a vector transformation matrix which transforms a feature embedding when interacting with a specific field.",
  "title": "Field-Embedded Factorization Machines for Click-through rate prediction",
  "collection": "Factorization Machines",
  "area": "General"
}
{
  "name": "MCKERNEL",
  "full_name": "MCKERNEL",
  "description": "McKernel introduces a framework to use kernel approximates in the mini-batch setting with Stochastic Gradient Descent ([SGD](https://paperswithcode.com/method/sgd)) as an alternative to Deep Learning.\r\n\r\nThe core library was developed in 2014 as integral part of a thesis of Master of Science [1,2] at Carnegie Mellon and City University of Hong Kong. The original intend was to implement a speedup of Random Kitchen Sinks (Rahimi and Recht 2007) by writing a very efficient HADAMARD tranform, which was the main bottleneck of the construction. The code though was later expanded at ETH Zürich (in McKernel by Curtó et al. 2017) to propose a framework that could explain both Kernel Methods and Neural Networks. This manuscript and the corresponding theses, constitute one of the first usages (if not the first) in the literature of FOURIER features and Deep Learning; which later got a lot of research traction and interest in the community.\r\n\r\nMore information can be found in this presentation that the first author gave at ICLR 2020 [iclr2020_DeCurto](https://www.decurto.tw/c/iclr2020_DeCurto.pdf).\r\n\r\n[1] [https://www.curto.hk/c/decurto.pdf](https://www.curto.hk/c/decurto.pdf)\r\n\r\n[2] [https://www.zarza.hk/z/dezarza.pdf](https://www.zarza.hk/z/dezarza.pdf)",
  "title": "McKernel: A Library for Approximate Kernel Expansions in Log-linear Time",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "AWD-LSTM",
  "full_name": "ASGD Weight-Dropped LSTM",
  "description": "**ASGD Weight-Dropped LSTM**, or **AWD-LSTM**, is a type of recurrent neural network that employs [DropConnect](https://paperswithcode.com/method/dropconnect) for regularization, as well as [NT-ASGD](https://paperswithcode.com/method/nt-asgd) for optimization - non-monotonically triggered averaged [SGD](https://paperswithcode.com/method/sgd) - which returns an average of last iterations of weights. Additional regularization techniques employed include variable length backpropagation sequences, [variational dropout](https://paperswithcode.com/method/variational-dropout), [embedding dropout](https://paperswithcode.com/method/embedding-dropout), [weight tying](https://paperswithcode.com/method/weight-tying), independent embedding/hidden size, [activation regularization](https://paperswithcode.com/method/activation-regularization) and [temporal activation regularization](https://paperswithcode.com/method/temporal-activation-regularization).",
  "title": "Regularizing and Optimizing LSTM Language Models",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "Siamese Network",
  "full_name": "Siamese Network",
  "description": "A **Siamese Network** consists of twin networks which accept distinct inputs but are joined by an energy function at the top. This function computes a metric between the highest level feature representation on each side. The parameters between the twin networks are tied. [Weight tying](https://paperswithcode.com/method/weight-tying) guarantees that two extremely similar images are not mapped by each network to very different locations in feature space because each network computes the same function. The network is symmetric, so that whenever we present two distinct images to the twin networks, the top conjoining layer will compute the same metric as if we were to we present the same two images but to the opposite twins.\r\n\r\nIntuitively instead of trying to classify inputs, a siamese network learns to differentiate between inputs, learning their similarity. The loss function used is usually a form of contrastive loss.\r\n\r\nSource: [Koch et al](https://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf)",
  "title": null,
  "collection": "Twin Networks",
  "area": "General"
}
{
  "name": "HMGNN",
  "full_name": "Heterogeneous Molecular Graph Neural Network",
  "description": "As they carry great potential for modeling complex interactions, graph neural network (GNN)-based methods have been widely used to predict quantum mechanical properties of molecules. Most of the existing methods treat molecules as molecular graphs in which atoms are modeled as nodes. They characterize each atom's chemical environment by modeling its pairwise interactions with other atoms in the molecule. Although these methods achieve a great success, limited amount of works explicitly take many-body interactions, i.e., interactions between three and more atoms, into consideration. In this paper, we introduce a novel graph representation of molecules, heterogeneous molecular graph (HMG) in which nodes and edges are of various types, to model many-body interactions. HMGs have the potential to carry complex geometric information. To leverage the rich information stored in HMGs for chemical prediction problems, we build heterogeneous molecular graph neural networks (HMGNN) on the basis of a neural message passing scheme. HMGNN incorporates global molecule representations and an attention mechanism into the prediction process. The predictions of HMGNN are invariant to translation and rotation of atom coordinates, and permutation of atom indices. Our model achieves state-of-the-art performance in 9 out of 12 tasks on the QM9 dataset.",
  "title": "Heterogeneous Molecular Graph Neural Networks for Predicting Molecule Properties",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "SNGAN",
  "full_name": "Spectrally Normalised GAN",
  "description": "**SNGAN**, or **Spectrally Normalised GAN**, is a type of generative adversarial network that uses [spectral normalization](https://paperswithcode.com/method/spectral-normalization), a type of [weight normalization](https://paperswithcode.com/method/weight-normalization), to stabilise the training of the discriminator.",
  "title": "Spectral Normalization for Generative Adversarial Networks",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Adaptive Loss",
  "full_name": "Adaptive Robust Loss",
  "description": "The Robust Loss is a generalization of the Cauchy/Lorentzian, Geman-McClure, Welsch/Leclerc, generalized Charbonnier, Charbonnier/pseudo-Huber/L1-L2, and L2 loss functions. By introducing robustness as a continuous parameter, the loss function allows algorithms built around robust loss minimization to be generalized, which improves performance on basic vision tasks such as registration and clustering. Interpreting the loss as the negative log of a univariate density yields a general probability distribution that includes normal and Cauchy distributions as special cases. This probabilistic interpretation enables the training of neural networks in which the robustness of the loss automatically adapts itself during training, which improves performance on learning-based tasks such as generative image synthesis and unsupervised monocular depth estimation, without requiring any manual parameter tuning.",
  "title": "A General and Adaptive Robust Loss Function",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "ResNeSt",
  "full_name": "ResNeSt",
  "description": "A **ResNest** is a variant on a [ResNet](https://paperswithcode.com/method/resnet), which instead stacks Split-Attention blocks. The cardinal group representations are then concatenated along the channel dimension: $V = \\text{Concat}${$V^{1},V^{2},\\cdots{V}^{K}$}. As in standard residual blocks, the final output $Y$ of otheur Split-Attention block is produced using a shortcut connection: $Y=V+X$, if the input and output feature-map share the same shape.  For blocks with a stride, an appropriate transformation $\\mathcal{T}$ is applied to the shortcut connection to align the output shapes:  $Y=V+\\mathcal{T}(X)$. For example, $\\mathcal{T}$ can be strided [convolution](https://paperswithcode.com/method/convolution) or combined convolution-with-pooling.",
  "title": "ResNeSt: Split-Attention Networks",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "DVD-GAN DBlock",
  "full_name": "DVD-GAN DBlock",
  "description": "**DVD-GAN DBlock** is a residual block for the discriminator used in the [DVD-GAN](https://paperswithcode.com/method/dvd-gan) architecture for video generation. Unlike regular [residual blocks](https://paperswithcode.com/method/residual-block), [3D convolutions](https://paperswithcode.com/method/3d-convolution) are employed due to the application to multiple frames in a video.",
  "title": "Adversarial Video Generation on Complex Datasets",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "CosLU",
  "full_name": "Cosine Linear Unit",
  "description": "The **Cosine Linear Unit**, or **CosLU**, is a type of activation function that has trainable parameters and uses the cosine function.\r\n\r\n$$CosLU(x) = (x + \\alpha \\cos(\\beta x))\\sigma(x)$$",
  "title": "Trainable Activations for Image Classification",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "FLAVA",
  "full_name": "FLAVA",
  "description": "FLAVA aims at building a single holistic universal model that targets all modalities at once. FLAVA is a language vision alignment model that learns strong representations from multimodal data (image-text pairs) and unimodal data (unpaired images and text). The model consists of an image encode transformer to capture unimodal image representations, a text encoder transformer to process unimodal text information, and a multimodal encode transformer that takes as input the encoded unimodal image and text and integrates their representations for multimodal reasoning. During pretraining, masked image modeling (MIM) and mask language modeling (MLM) losses are applied onto the image and text encoders over a single image or a text piece, respectively, while contrastive, masked multimodal modeling (MMM), and image-text matching (ITM) loss are used over paired image-text data. For downstream tasks, classification heads are applied on the outputs from the image, text, and multimodal encoders respectively for visual recognition, language understanding, and multimodal reasoning tasks It can be applied to broad scope of tasks from three domains (visual recognition, language understanding, and multimodal reasoning) under a common transformer model architecture.",
  "title": "FLAVA: A Foundational Language And Vision Alignment Model",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "ECANet",
  "full_name": "efficient channel attention",
  "description": "An ECA block has similar formulation to an SE block including a squeeze module for aggregating global spatial information and an efficient excitation module for modeling cross-channel interaction. Instead of indirect correspondence, an ECA block only considers direct interaction between each channel and its k-nearest neighbors to control model complexity. Overall, the formulation of an ECA block is:\r\n\\begin{align}\r\n    s = F_\\text{eca}(X, \\theta) & = \\sigma (\\text{Conv1D}(\\text{GAP}(X))) \r\n\\end{align}\r\n\\begin{align}\r\n    Y & = s  X\r\n\\end{align}\r\nwhere $\\text{Conv1D}(\\cdot)$ denotes 1D convolution with a kernel of shape $k$ across the channel domain, to model local cross-channel interaction. The parameter $k$ decides the coverage of interaction, and in ECA the kernel size $k$ is adaptively determined from the channel dimensionality $C$ instead of by manual tuning, using cross-validation:\r\n\\begin{equation}\r\n    k = \\psi(C) = \\left | \\frac{\\log_2(C)}{\\gamma}+\\frac{b}{\\gamma}\\right |_\\text{odd}\r\n\\end{equation}\r\n\r\nwhere $\\gamma$ and $b$ are hyperparameters. $|x|_\\text{odd}$ indicates the nearest odd function of $x$. \r\n\r\nCompared to SENet, ECANet has an \r\nimproved excitation module, and provides an efficient and effective block which can readily be \r\n incorporated into various\r\nCNNs.",
  "title": "ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "DeepViT",
  "full_name": "DeepViT",
  "description": "**DeepViT** is a type of [vision transformer](https://paperswithcode.com/method/vision-transformer) that replaces the self-attention layer within the [transformer](https://paperswithcode.com/method/transformer) block with a [Re-attention module](https://paperswithcode.com/method/re-attention-module) to address the issue of attention collapse and enables training deeper ViTs.",
  "title": "DeepViT: Towards Deeper Vision Transformer",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "BatchChannel Normalization",
  "full_name": "BatchChannel Normalization",
  "description": "**Batch-Channel Normalization**, or **BCN**, uses batch knowledge to prevent channel-normalized models from getting too close to \"elimination singularities\". Elimination singularities correspond to the points on the training trajectory where neurons become consistently deactivated. They cause degenerate manifolds in the loss landscape which will slow down training and harm model performances.",
  "title": "Rethinking Normalization and Elimination Singularity in Neural Networks",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "SESAME Discriminator",
  "full_name": "SESAME Discriminator",
  "description": "Extends [PatchGAN](https://paperswithcode.com/method/patchgan) discriminator for the task of layout2image generation. The discriminator is comprised of two processing streams: one for the RGB image and one for its semantics, which are fused together at the later stages of the discriminator.",
  "title": "SESAME: Semantic Editing of Scenes by Adding, Manipulating or Erasing Objects",
  "collection": "Discriminators",
  "area": "General"
}
{
  "name": "MuZero",
  "full_name": "MuZero",
  "description": "**MuZero** is a model-based reinforcement learning algorithm. It builds upon [AlphaZero](https://paperswithcode.com/method/alphazero)'s search and search-based policy iteration algorithms, but incorporates a learned model into the training procedure. \r\n\r\nThe main idea of the algorithm is to predict those aspects of the future that are directly relevant for planning. The model receives the observation (e.g. an image of the Go board or the Atari screen) as an\r\ninput and transforms it into a hidden state. The hidden state is then updated iteratively by a recurrent process that receives the previous hidden state and a hypothetical next action. At every one of these steps the model predicts the policy (e.g. the move to play), value function (e.g. the predicted winner), and immediate reward (e.g. the points scored by playing a move). The model is trained end-to-end, with the sole objective of accurately estimating these three important quantities, so as to match the improved estimates of policy and value generated by search as well as the observed reward. \r\n\r\nThere is no direct constraint or requirement for the hidden state to capture all information necessary to reconstruct the original observation, drastically reducing the amount of information the model has to maintain and predict; nor is there any requirement for the hidden state to match the unknown, true state of the environment; nor any other constraints on the semantics of state. Instead, the hidden states are free to represent state in whatever way is relevant to predicting current and future values and policies. Intuitively, the agent can invent, internally, the rules or dynamics that lead to most accurate planning.",
  "title": "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model",
  "collection": "Board Game Models",
  "area": "Reinforcement Learning"
}
{
  "name": "HEGCN",
  "full_name": "Hierarchical Entity Graph Convolutional Network",
  "description": "**HEGCN**, or **Hierarchical Entity Graph Convolutional Network** is a model for multi-hop relation extraction across documents. Documents in a document chain are encoded using a bi-directional long short-term memory ([BiLSTM](https://paperswithcode.com/method/bilstm)) layer. On top of the BiLSTM layer, two graph convolutional networks ([GCN](https://paperswithcode.com/method/gcn)) are used, one after another in a hierarchy. \r\n\r\nIn the first level of the GCN hierarchy, a separate entity mention graph is constructed on each document of the chain using all the entities mentioned in that document. Each mention of an entity in a document is considered as a separate node in the graph. A graph convolutional network (GCN) is used to represent the entity mention graph of each document to capture the relations among the entity mentions in the document. A unified entity-level graph is then constructed across all the documents in the chain. Each node of this entity-level graph represents a unique entity in the document chain. Each common entity between two documents in the chain is represented by a single node in the graph. A GCN is used to represent this entity-level graph to capture the relations among the entities across the documents. \r\n\r\nThe representations of the nodes of the subject entity and object entity are concatenated and passed to a feed-forward layer with [softmax](https://paperswithcode.com/method/softmax) for relation classification.",
  "title": "A Hierarchical Entity Graph Convolutional Network for Relation Extraction across Documents",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "Channel attention",
  "full_name": "squeeze-and-excitation networks",
  "description": "SENet pioneered channel attention. The core of SENet is a squeeze-and-excitation (SE) block which is used to collect global information, capture channel-wise relationships and improve representation ability.\r\nSE blocks are divided into two parts, a squeeze module and an excitation module. Global spatial information is collected in the squeeze module by global average pooling. The excitation module captures channel-wise relationships and outputs an attention vector by using fully-connected layers and non-linear layers (ReLU and sigmoid). Then, each channel of the input feature is scaled by multiplying the corresponding element in the attention vector. Overall, a squeeze-and-excitation block $F_\\text{se}$ (with parameter $\\theta$) which takes $X$ as input and outputs $Y$ can be formulated \r\nas:\r\n\\begin{align}\r\n    s = F_\\text{se}(X, \\theta) & = \\sigma (W_{2} \\delta (W_{1}\\text{GAP}(X)))\r\n\\end{align}\r\n\\begin{align}\r\n    Y = sX\r\n\\end{align}",
  "title": "Squeeze-and-Excitation Networks",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "ESPNet",
  "full_name": "ESPNet",
  "description": "**ESPNet** is a convolutional neural network for semantic segmentation of high resolution images under resource constraints. ESPNet is based on a convolutional module, efficient spatial pyramid ([ESP](https://paperswithcode.com/method/esp)), which is efficient in terms of computation, memory, and power.",
  "title": "ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "Wide&Deep",
  "full_name": "Wide&Deep",
  "description": "**Wide&Deep** jointly trains wide linear models and deep neural networks to combine the benefits of memorization and generalization for real-world recommender systems. In summary, the wide component is a generalized linear model. The deep component is a feed-forward neural network. The deep and wide components are combined using a weighted sum of their output log odds as the prediction. This is then fed to a logistic loss function for joint training, which is done by back-propagating the gradients from the output to both the wide and deep part of the model simultaneously using mini-batch stochastic optimization. The AdaGrad optimizer is used for the wider part. The combined model is illustrated in the figure (center).",
  "title": "Wide & Deep Learning for Recommender Systems",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "DNAS",
  "full_name": "Differentiable Neural Architecture Search",
  "description": "**DNAS**, or **Differentiable Neural Architecture Search**, uses gradient-based methods to optimize ConvNet architectures, avoiding enumerating and training individual architectures separately as in previous methods. DNAS allows us to explore a layer-wise search space where we can choose a different block for each layer of the network. DNAS represents the search space by a super net whose operators execute stochastically. It relaxes the problem of finding the optimal architecture to find a distribution that yields the optimal architecture. By using the [Gumbel Softmax](https://paperswithcode.com/method/gumbel-softmax) technique, it is possible to directly train the architecture distribution using gradient-based optimization such as [SGD](https://paperswithcode.com/method/sgd).\r\n\r\nThe loss used to train the stochastic super net consists of both the cross-entropy loss that leads to better accuracy and the latency loss that penalizes the network's latency on a target device. To estimate the latency of an architecture, the latency of each operator in the search space is measured and a lookup table model is used to compute the overall latency by adding up the latency of each operator. Using this model allows for estimation of the latency of architectures in an enormous search space. More importantly, it makes the latency differentiable with respect to layer-wise block choices.",
  "title": "FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search",
  "collection": "Neural Architecture Search",
  "area": "General"
}
{
  "name": "SmeLU",
  "full_name": "Smooth ReLU",
  "description": "Please enter a description about the method here",
  "title": null,
  "collection": "Recommendation Systems",
  "area": "General"
}
{
  "name": "InceptionTime",
  "full_name": "InceptionTime",
  "description": "",
  "title": "InceptionTime: Finding AlexNet for Time Series Classification",
  "collection": "Time Series Analysis",
  "area": "Sequential"
}
{
  "name": "HyperHyperNetwork",
  "full_name": "Hyper HyperNetwork",
  "description": "",
  "title": "HyperHyperNetworks for the Design of Antenna Arrays",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "TuckER",
  "full_name": "TuckER",
  "description": "TuckER",
  "title": "TuckER: Tensor Factorization for Knowledge Graph Completion",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "HyperNetwork",
  "full_name": "HyperNetwork",
  "description": "A **HyperNetwork** is a network that generates weights for a main network.  The behavior of the main network is the same with any usual neural network: it learns to map some raw inputs to their desired targets; whereas the hypernetwork takes a set of inputs that contain information about the structure of the weights and generates the weight for that layer.",
  "title": "HyperNetworks",
  "collection": "Feedforward Networks",
  "area": "General"
}
{
  "name": "Random Synthesized Attention",
  "full_name": "Random Synthesized Attention",
  "description": "**Random Synthesized Attention** is a form of synthesized attention where the attention weights are not conditioned on any input tokens. Instead, the attention weights are initialized to random values. It was introduced with the [Synthesizer](https://paperswithcode.com/method/synthesizer) architecture. Random Synthesized Attention contrasts with [Dense Synthesized Attention](https://paperswithcode.com/method/dense-synthesized-attention) which conditions on each token independently, as opposed to pairwise token interactions in the vanilla [Transformer](https://paperswithcode.com/method/transformer) model.\r\n\r\nLet $R$ be a randomly initialized matrix. Random Synthesized Attention is defined as:\r\n\r\n$$Y = \\text{Softmax}\\left(R\\right)G\\left(X\\right) $$\r\n\r\nwhere $R \\in \\mathbb{R}^{l \\text{ x } l}$. Notably, each head adds 2 parameters to the overall network. The basic idea of the Random Synthesizer is to not rely on pairwise token interactions or any information from individual token but rather to learn a task-specific alignment that works well globally across many samples. This is a direct generalization of the recently proposed fixed self-attention patterns of [Raganato et al (2020)](https://arxiv.org/abs/2002.10260).",
  "title": "Synthesizer: Rethinking Self-Attention in Transformer Models",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Adam",
  "full_name": "Adam",
  "description": "**Adam** is an adaptive learning rate optimization algorithm that utilises both momentum and scaling, combining the benefits of [RMSProp](https://paperswithcode.com/method/rmsprop) and [SGD w/th Momentum](https://paperswithcode.com/method/sgd-with-momentum). The optimizer is designed to be appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. \r\n\r\nThe weight updates are performed as:\r\n\r\n$$ w_{t} = w_{t-1} - \\eta\\frac{\\hat{m}\\_{t}}{\\sqrt{\\hat{v}\\_{t}} + \\epsilon}  $$\r\n\r\nwith\r\n\r\n$$ \\hat{m}\\_{t} = \\frac{m_{t}}{1-\\beta^{t}_{1}} $$\r\n\r\n$$ \\hat{v}\\_{t} = \\frac{v_{t}}{1-\\beta^{t}_{2}} $$\r\n\r\n$$ m_{t} = \\beta_{1}m_{t-1} + (1-\\beta_{1})g_{t} $$\r\n\r\n$$ v_{t} = \\beta_{2}v_{t-1} + (1-\\beta_{2})g_{t}^{2}  $$\r\n\r\n\r\n$ \\eta $ is the step size/learning rate, around 1e-3 in the original paper. $ \\epsilon $ is a small number, typically 1e-8 or 1e-10, to prevent dividing by zero. $ \\beta_{1} $ and $ \\beta_{2} $ are forgetting parameters, with typical values 0.9 and 0.999, respectively.",
  "title": "Adam: A Method for Stochastic Optimization",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "DeBERTa",
  "full_name": "DeBERTa",
  "description": "**DeBERTa** is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based neural language model that aims to improve the [BERT](https://paperswithcode.com/method/bert) and [RoBERTa](https://paperswithcode.com/method/roberta) models with two techniques: a [disentangled attention mechanism](https://paperswithcode.com/method/disentangled-attention-mechanism) and an enhanced mask decoder. The disentangled attention mechanism is where each word is represented unchanged using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangle matrices on their contents and relative positions. The enhanced mask decoder is used to replace the output [softmax](https://paperswithcode.com/method/softmax) layer to predict the masked tokens for model pre-training.  In addition, a new virtual adversarial training method is used for fine-tuning to improve model’s generalization on downstream tasks.",
  "title": "DeBERTa: Decoding-enhanced BERT with Disentangled Attention",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "KOVA",
  "full_name": "Kalman Optimization for Value Approximation",
  "description": "**Kalman Optimization for Value Approximation**, or **KOVA** is a general framework for addressing uncertainties while approximating value-based functions in deep RL domains. KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties. It is feasible when using non-linear approximation functions as DNNs and can estimate the value in both on-policy and off-policy settings. It can be incorporated as a policy evaluation component in policy optimization algorithms.",
  "title": "Kalman meets Bellman: Improving Policy Evaluation through Value Tracking",
  "collection": "Policy Evaluation",
  "area": "Reinforcement Learning"
}
{
  "name": "Highway Network",
  "full_name": "Highway Network",
  "description": "A **Highway Network** is an architecture designed to ease gradient-based training of very deep networks. They allow unimpeded information flow across several layers on \"information highways\". The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety of activation functions.",
  "title": "Highway Networks",
  "collection": "Feedforward Networks",
  "area": "General"
}
{
  "name": "Deformable DETR",
  "full_name": "Deformable DETR",
  "description": "**Deformable DETR** is an object detection method that aims mitigates the slow convergence and high complexity issues of [DETR](https://www.paperswithcode.com/method/detr). It combines the best of the sparse spatial sampling of [deformable convolution](https://paperswithcode.com/method/deformable-convolution), and the relation modeling capability of [Transformers](https://paperswithcode.com/methods/category/transformers). Specifically, it introduces a \r\n deformable attention module, which attends to a small set of sampling locations as a pre-filter for prominent key elements out of all the feature map pixels. The module can be naturally extended to aggregating multi-scale features, without the help of [FPN](https://paperswithcode.com/method/fpn).",
  "title": "Deformable DETR: Deformable Transformers for End-to-End Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "Huber loss",
  "full_name": "Huber loss",
  "description": "The Huber loss function describes the penalty incurred by an estimation procedure f. Huber (1964) defines the loss function piecewise by[1]\r\n\r\n    L δ ( a ) = { 1 2 a 2 for  | a | ≤ δ , δ ⋅ ( | a | − 1 2 δ ) , otherwise. {\\displaystyle L_{\\delta }(a)={\\begin{cases}{\\frac {1}{2}}{a^{2}}&{\\text{for }}|a|\\leq \\delta ,\\\\\\delta \\cdot \\left(|a|-{\\frac {1}{2}}\\delta \\right),&{\\text{otherwise.}}\\end{cases}}}\r\n\r\nThis function is quadratic for small values of a, and linear for large values, with equal values and slopes of the different sections at the two points where | a | = δ |a|=\\delta . The variable a often refers to the residuals, that is to the difference between the observed and predicted values a = y − f ( x ) a=y-f(x), so the former can be expanded to[2]\r\n\r\n    L δ ( y , f ( x ) ) = { 1 2 ( y − f ( x ) ) 2 for  | y − f ( x ) | ≤ δ , δ   ⋅ ( | y − f ( x ) | − 1 2 δ ) , otherwise. {\\displaystyle L_{\\delta }(y,f(x))={\\begin{cases}{\\frac {1}{2}}(y-f(x))^{2}&{\\text{for }}|y-f(x)|\\leq \\delta ,\\\\\\delta \\ \\cdot \\left(|y-f(x)|-{\\frac {1}{2}}\\delta \\right),&{\\text{otherwise.}}\\end{cases}}}\r\n\r\nThe Huber loss is the convolution of the absolute value function with the rectangular function, scaled and translated. Thus it \"smoothens out\" the former's corner at the origin. \r\n\r\n.. math::\r\n        \\ell(x, y) = L = \\{l_1, ..., l_N\\}^T\r\n\r\n    with\r\n\r\n    .. math::\r\n        l_n = \\begin{cases}\r\n        0.5 (x_n - y_n)^2, & \\text{if } |x_n - y_n| < delta \\\\\r\n        delta * (|x_n - y_n| - 0.5 * delta), & \\text{otherwise }\r\n        \\end{cases}",
  "title": null,
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "STDC",
  "full_name": "Short-Term Dense Concatenate",
  "description": "**STDC**, or **Short-Term Dense Concatenate**, is a module for semantic segmentation to extract deep features with scalable\r\nreceptive field and multi-scale information. It aims to remove structure redundancy in the BiSeNet architecture, specifically BiSeNet adds an extra path to encode spatial information which can be time-consuming,. Instead, STDC gradually reduces the dimension of feature maps and use the aggregation of them for image representation.\r\n\r\nWe concatenate response maps from multiple continuous layers, each of which encodes input image/feature in different scales and respective fields, leading to multi-scale feature representation. To speed up, the filter size of layers is gradually reduced with negligible loss in segmentation performance.",
  "title": "Rethinking BiSeNet For Real-time Semantic Segmentation",
  "collection": "Semantic Segmentation Modules",
  "area": "Computer Vision"
}
{
  "name": "Magnification Prior Contrastive Similarity",
  "full_name": "Magnification Prior Contrastive Similarity",
  "description": "Self-supervised pre-training method to learn efficient representations without labels on histopathology medical images utilizing magnification factors.",
  "title": "Magnification Prior: A Self-Supervised Method for Learning Representations on Breast Cancer Histopathological Images",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "RepVGG",
  "full_name": "RepVGG",
  "description": "**RepVGG** is a [VGG](https://paperswithcode.com/method/vgg)-style convolutional architecture. It has the following advantages:\r\n\r\n- The model has a VGG-like plain (a.k.a. feed-forward) topology 1 without any branches. I.e., every layer takes\r\nthe output of its only preceding layer as input and feeds the output into its only following layer.\r\n- The model’s body uses only 3 × 3 conv and [ReLU](https://paperswithcode.com/method/relu).\r\n- The concrete architecture (including the specific depth and layer widths) is instantiated with no automatic\r\nsearch, manual refinement, compound scaling, nor other heavy designs.",
  "title": "RepVGG: Making VGG-style ConvNets Great Again",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "reSGLD",
  "full_name": "Replica exchange stochastic gradient Langevin Dynamics",
  "description": "reSGLD proposes to simulate a high-temperature particle for exploration and a low-temperature particle for exploitation and allows them to swap simultaneously. Moreover, a correction term is included to avoid biases.",
  "title": "Non-convex Learning via Replica Exchange Stochastic Gradient MCMC",
  "collection": "Markov Chain Monte Carlo",
  "area": "General"
}
{
  "name": "BS-Net",
  "full_name": "BS-Net",
  "description": "**BS-Net** is an architecture for COVID-19 severity prediction based on clinical data from different modalities. The architecture comprises 1) a shared multi-task feature extraction backbone, 2) a lung segmentation branch, 3) an original registration mechanism that acts as a ”multi-resolution feature alignment” block operating on the encoding backbone , and 4) a multi-regional classification part for the final six-valued score estimation. \r\n\r\nAll these blocks act together in the final training thanks to a loss specifically crated for this task. This loss guarantees also performance robustness, comprising a differentiable version of the target discrete metric. The learning phase operates in a weakly-supervised fashion. This is due to the fact that difficulties and pitfalls in the visual interpretation of the disease signs on CXRs (spanning from subtle findings to heavy lung impairment), and the lack of detailed localization information, produces unavoidable inter-rater variability among radiologists in assigning scores.\r\n\r\nSpecifically the architectural details are:\r\n\r\n- The input image is processed with a convolutional backbone; the authors opt for a [ResNet](https://paperswithcode.com/method/resnet)-18.\r\n- Segmentation is performed by a nested version of [U-Net](https://paperswithcode.com/method/u-net) (U-Net++).\r\n- Alignment is estimated through the segmentation probability map produced by the U-Net++ decoder, which is achieved through a [spatial transformer network](https://paperswithcode.com/method/spatial-transformer) -- able to estimate the spatial transform matrix in order to center, rotate, and correctly zoom the lungs. After alignment at various scales, features are forward to a [ROIPool](https://paperswithcode.com/method/roi-pooling). \r\n- The alignment block is pre-trained on the synthetic alignment dataset in a weakly-supervised setting, using a Dice loss.\r\n- The scoring head uses [FPNs](https://paperswithcode.com/method/fpn) for the combination of multi-scale feature maps. The multiresolution feature aligner produces input feature maps that are well focused on the specific area of interest. Eventually, the output of the FPN layer flows in a series of convolutional blocks to retrieve the output map. The classification is performed by a final [Global Average Pooling](https://paperswithcode.com/method/global-average-pooling) layer and a [SoftMax](https://paperswithcode.com/method/softmax) activation.\r\n- The Loss function used for training is a sparse categorical cross entropy (SCCE) with a (differentiable) mean absolute error contribution.",
  "title": "BS-Net: learning COVID-19 pneumonia severity on a large Chest X-Ray dataset",
  "collection": "Medical Image Models",
  "area": "Computer Vision"
}
{
  "name": "GPT-NeoX",
  "full_name": "GPT-NeoX",
  "description": "**GPT-NeoX** is an autoregressive transformer decoder model whose architecture largely follows that of GPT-3, with a few notable deviations. The model has 20 billion parameters with 44 layers, a hidden dimension size of 6144, and 64 heads. The main difference with GPT-3 is the change in tokenizer, the addition of Rotary Positional Embeddings, the parallel computation of attention and feed-forward layers, and a different initialization scheme and hyperparameters.",
  "title": "GPT-NeoX-20B: An Open-Source Autoregressive Language Model",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "Neural Probabilistic Language Model",
  "full_name": "Neural Probabilistic Language Model",
  "description": "A **Neural Probablistic Language Model** is an early language modelling architecture. It involves a feedforward architecture that takes in input vector representations (i.e. word embeddings) of the previous $n$ words, which are looked up in a table $C$.\r\n\r\nThe word embeddings are concatenated and fed into a hidden layer which then feeds into a [softmax](https://paperswithcode.com/method/softmax) layer to estimate the probability of the word given the context.",
  "title": null,
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "SOHO",
  "full_name": "SOHO",
  "description": "SOHO (“See Out of tHe bOx”) that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches. Text embeddings are used to extract textual embedding features. A trainable CNN is used to extract visual representations. SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal understanding. VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in the proposed pre-training task Masked Visual Modeling (MVM).",
  "title": "Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "Feature Selection",
  "full_name": "Feature Selection",
  "description": "Feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction.",
  "title": "Feature Selection and Feature Extraction in Pattern Analysis: A Literature Review",
  "collection": "AutoML",
  "area": "General"
}
{
  "name": "LiteSeg",
  "full_name": "LiteSeg",
  "description": "**LiteSeg** is a lightweight architecture for semantic segmentation that uses a deeper version of Atrous [Spatial Pyramid Pooling](https://paperswithcode.com/method/spatial-pyramid-pooling) module ([ASPP](https://paperswithcode.com/method/aspp)) and applies short and long residual connections, and [depthwise separable convolution](https://paperswithcode.com/method/depthwise-separable-convolution), resulting in a faster, more efficient model.",
  "title": "LiteSeg: A Novel Lightweight ConvNet for Semantic Segmentation",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "MoBY",
  "full_name": "MoBY",
  "description": "**MoBY** is a self-supervised learning approach for [Vision Transformers](methods/category/vision-transformer). The approach is basically a combination of [MoCo v2](https://paperswithcode.com/method/moco-v2) and [BYOL](https://paperswithcode.com/method/byol). It inherits the momentum design, the key queue, and the contrastive loss used in MoCo v2, and inherits the asymmetric encoders, asymmetric data augmentations and the momentum scheduler in BYOL. It is named MoBY by picking the first two letters of each method.\r\n\r\nThe MoBY approach is illustrated in the Figure. There are two encoders: an online encoder and a target encoder. Both two encoders consist of a backbone and a projector head ([2-layer MLP](https://paperswithcode.com/method/feedforward-network)), and the online encoder introduces an additional prediction head (2-layer MLP), which makes the two encoders asymmetric. The online encoder is updated by gradients, and the target encoder is a moving average of the online encoder by momentum updating in each training iteration. A gradually increasing momentum updating strategy is applied for on the target encoder: the value of momentum term is gradually increased to 1 during the course of training. The default starting value is $0.99$.\r\n\r\nA contrastive loss is applied to learn the representations. Specifically, for an online view $q$, its contrastive loss is computed as\r\n\r\n$$\r\n\\mathcal{L}\\_{q}=-\\log \\frac{\\exp \\left(q \\cdot k\\_{+} / \\tau\\right)}{\\sum\\_{i=0}^{K} \\exp \\left(q \\cdot k\\_{i} / \\tau\\right)}\r\n$$\r\n\r\nwhere $k\\_{+}$is the target feature for the other view of the same image; $k\\_{i}$ is a target feature in the key queue; $\\tau$ is a temperature term; $K$ is the size of the key queue (4096 by default).\r\n\r\nIn training, like most [Transformer-based methods](https://paperswithcode.com/methods/category/transformers), the [AdamW](https://paperswithcode.com/method/adamw) optimizer is used, in contrast to previous [self-supervised learning approaches](https://paperswithcode.com/methods/category/self-supervised-learning) built on [ResNet](https://paperswithcode.com/method/resnet) backbone where usually [SGD](https://paperswithcode.com/method/sgd-with-momentum) or [LARS](https://paperswithcode.com/method/lars) $[4,8,19]$ is used. The authors also use a regularization method of asymmetric [drop path](https://paperswithcode.com/method/droppath) which proves important for the final performance.\r\n\r\nIn the experiments, the authors adopt a fixed learning rate of $0.001$ and a fixed weight decay of $0.05$, which performs stably well. Hyper-parameters are tuned of the key queue size $K$, the starting momentum value of the target branch, the temperature $\\tau$, and the drop path rates.",
  "title": "Self-Supervised Learning with Swin Transformers",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "AdaMod",
  "full_name": "AdaMod",
  "description": "**AdaMod** is a stochastic optimizer that restricts adaptive learning rates with adaptive and momental upper bounds. The dynamic learning rate bounds are based on the exponential moving averages of the adaptive learning rates themselves, which smooth out unexpected large learning rates and stabilize the training of deep neural networks.\r\n\r\n\r\nThe weight updates are performed as:\r\n\r\n\r\n$$ g\\_{t} = \\nabla{f}\\_{t}\\left(\\theta\\_{t-1}\\right) $$\r\n\r\n$$ m\\_{t} = \\beta\\_{1}m\\_{t-1} + \\left(1-\\beta\\_{1}\\right)g\\_{t} $$\r\n\r\n$$ v\\_{t} = \\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g\\_{t}^{2} $$\r\n\r\n$$ \\hat{m}\\_{t} = m\\_{t} / \\left(1 - \\beta^{t}\\_{1}\\right)$$\r\n\r\n$$ \\hat{v}\\_{t} = v\\_{t} / \\left(1 - \\beta^{t}\\_{2}\\right)$$\r\n\r\n$$ \\eta\\_{t} = \\alpha\\_{t} / \\left(\\sqrt{\\hat{v}\\_{t}} + \\epsilon\\right) $$\r\n\r\n$$ s\\_{t} = \\beta\\_{3}s\\_{t-1} + (1-\\beta\\_{3})\\eta\\_{t} $$\r\n\r\n$$ \\hat{\\eta}\\_{t} = \\text{min}\\left(\\eta\\_{t}, s\\_{t}\\right) $$\r\n\r\n$$ \\theta\\_{t} = \\theta\\_{t-1} - \\hat{\\eta}\\_{t}\\hat{m}\\_{t} $$",
  "title": "An Adaptive and Momental Bound Method for Stochastic Learning",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Cross-Covariance Attention",
  "full_name": "Cross-Covariance Attention",
  "description": "**Cross-Covariance Attention**, or **XCA**, is an [attention mechanism](https://paperswithcode.com/methods/category/attention-mechanisms-1) which operates along the feature dimension instead of the token dimension as in [conventional transformers](https://paperswithcode.com/methods/category/transformers).\r\n\r\nUsing the definitions of queries, keys and values from conventional attention, the cross-covariance attention function is defined as:\r\n\r\n$$\r\n\\text { XC-Attention }(Q, K, V)=V \\mathcal{A}_{\\mathrm{XC}}(K, Q), \\quad \\mathcal{A}\\_{\\mathrm{XC}}(K, Q)=\\operatorname{Softmax}\\left(\\hat{K}^{\\top} \\hat{Q} / \\tau\\right)\r\n$$\r\n\r\nwhere each output token embedding is a convex combination of the $d\\_{v}$ features of its corresponding token embedding in $V$. The attention weights $\\mathcal{A}$ are computed based on the cross-covariance matrix.",
  "title": "XCiT: Cross-Covariance Image Transformers",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "VoiceFilter-Lite",
  "full_name": "VoiceFilter-Lite",
  "description": "**VoiceFilter-Lite** is a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. In this architecture, the voice filtering model operates as a frame-by-frame frontend signal processor to enhance the features consumed by the speech recognizer, without reconstructing audio signals from the features. The key contributions are (1) A system to perform speech separation directly on ASR input features; (2) An asymmetric loss function to penalize oversuppression during training, to make the model harmless under various acoustic environments, (3) An adaptive suppression strength mechanism to adapt to different noise conditions.",
  "title": "VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition",
  "collection": "Speech Separation Models",
  "area": "Audio"
}
{
  "name": "GLOW",
  "full_name": "GLOW",
  "description": "**GLOW** is a type of flow-based generative model that is based on an invertible $1 \\times 1$ [convolution](https://paperswithcode.com/method/convolution). This builds on the flows introduced by [NICE](https://paperswithcode.com/method/nice) and [RealNVP](https://paperswithcode.com/method/realnvp). It consists of a series of steps of flow, combined in a multi-scale architecture; see the Figure to the right. Each step of flow consists of Act Normalization followed by an *invertible $1 \\times 1$ convolution* followed by an [affine coupling](https://paperswithcode.com/method/affine-coupling) layer.",
  "title": "Glow: Generative Flow with Invertible 1x1 Convolutions",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "OneR",
  "full_name": "One Representation",
  "description": "In the OneR method, model input can be one of image, text or image+text, and CMC objective is combined with the traditional image-text contrastive (ITC) loss. Masked modeling is also carried out for all three input types (i.e., image, text and multi-modal). This framework employs no modality-specific architectural component except for the initial token embedding layer, making our model generic and modality-agnostic with minimal inductive bias.",
  "title": "Unifying Vision-Language Representation Space with Single-tower Transformer",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "Channel & Spatial attention",
  "full_name": "Channel & Spatial attention",
  "description": "Channel & spatial attention combines the advantages of channel attention and spatial attention. It adaptively selects both important objects and regions",
  "title": "Residual Attention Network for Image Classification",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "EEND",
  "full_name": "End-to-End Neural Diarization",
  "description": "**End-to-End Neural Diarization** is a neural network for speaker diarization in which a neural network directly outputs speaker diarization results given a multi-speaker recording. To realize such an end-to-end model, the speaker diarization problem is formulated as a multi-label classification problem and a permutation-free objective function is introduced to directly minimize diarization errors. The EEND method can explicitly handle speaker overlaps during training and inference. Just by feeding multi-speaker recordings with corresponding speaker segment labels, the model can be adapted to real conversations.",
  "title": "End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification",
  "collection": "Speaker Diarization",
  "area": "Audio"
}
{
  "name": "Center Pooling",
  "full_name": "Center Pooling",
  "description": "**Center Pooling** is a pooling technique for object detection that aims to capture richer and more recognizable visual patterns. The geometric centers of objects do not necessarily convey very recognizable visual patterns (e.g., the human head contains strong visual patterns, but the center keypoint is often in the middle of the human body). \r\n\r\nThe detailed process of center pooling is as follows: the backbone outputs a feature map, and to determine if a pixel in the feature map is a center keypoint, we need to find the maximum value in its both horizontal and vertical directions and add them together. By doing this, center pooling helps the better detection of center keypoints.",
  "title": "CenterNet: Keypoint Triplets for Object Detection",
  "collection": "Pooling Operations",
  "area": "Computer Vision"
}
{
  "name": "NADAM",
  "full_name": "NADAM",
  "description": "**NADAM**, or **Nesterov-accelerated Adaptive Moment Estimation**, combines [Adam](https://paperswithcode.com/method/adam) and [Nesterov Momentum](https://paperswithcode.com/method/nesterov-accelerated-gradient). The update rule is of the form:\r\n\r\n$$ \\theta\\_{t+1} = \\theta\\_{t} - \\frac{\\eta}{\\sqrt{\\hat{v}\\_{t}}+\\epsilon}\\left(\\beta\\_{1}\\hat{m}\\_{t} + \\frac{(1-\\beta\\_{t})g\\_{t}}{1-\\beta^{t}\\_{1}}\\right)$$\r\n\r\nImage Source: [Incorporating Nesterov Momentum into Adam](http://cs229.stanford.edu/proj2015/054_report.pdf)",
  "title": null,
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Hydra",
  "full_name": "Hydra",
  "description": "**Hydra** is a multi-headed neural network for model distillation with a shared body network. The shared body network learns a joint feature representation that enables each head to capture the predictive behavior of each ensemble member.  Existing distillation methods often train a distillation network to imitate the prediction of a larger network. Hydra instead learns to distill the individual predictions of each ensemble member into separate light-weight head models while amortizing the computation through a shared heavy-weight body network. This retains the diversity of ensemble member predictions which is otherwise lost in knowledge distillation.",
  "title": "Hydra: Preserving Ensemble Diversity for Model Distillation",
  "collection": "Knowledge Distillation",
  "area": "General"
}
{
  "name": "VTDE",
  "full_name": "Variational Trace Distance Estimation",
  "description": "**Variational Trace Distance Estimation**, or **VTDE**, is a variational algorithm for trace norm estimation that only involves one ancillary qubit. Notably, the cost function in VTDE gathers information from a single-qubit observable and thus could avoid the barren plateau issue with logarithmic depth parameterized circuits.",
  "title": "Variational Quantum Algorithms for Trace Distance and Fidelity Estimation",
  "collection": "Quantum Methods",
  "area": "General"
}
{
  "name": "XLSR",
  "full_name": "XLSR",
  "description": "**XLSR** is a multilingual speech recognition model built on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations and jointly learns a quantization of the latents shared across languages. The model is fine-tuned on labeled data and experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining. A shared quantization module over feature encoder representations produces multilingual quantized speech units whose embeddings are then used as targets for a [Transformer](https://paperswithcode.com/method/transformer) trained by contrastive learning. The model learns to share discrete tokens across languages, creating bridges across languages.",
  "title": "Unsupervised Cross-lingual Representation Learning for Speech Recognition",
  "collection": "Speech Recognition",
  "area": "Audio"
}
{
  "name": "ULMFiT",
  "full_name": "Universal Language Model Fine-tuning",
  "description": "**Universal Language Model Fine-tuning**, or **ULMFiT**, is an architecture and transfer learning method that can be applied to NLP tasks. It involves a 3-layer [AWD-LSTM](https://paperswithcode.com/method/awd-lstm) architecture for its representations. The training consists of three steps: 1) general language model pre-training on a Wikipedia-based text, 2) fine-tuning the language model on a target task, and 3) fine-tuning the classifier on the target task.\r\n\r\nAs different layers capture different types of information, they are fine-tuned to different extents using [discriminative fine-tuning](https://paperswithcode.com/method/discriminative-fine-tuning). Training is performed using [Slanted triangular learning rates](https://paperswithcode.com/method/slanted-triangular-learning-rates) (STLR), a learning rate scheduling strategy that first linearly increases the learning rate and then linearly decays it.\r\n\r\nFine-tuning the target classifier is achieved in ULMFiT using gradual unfreezing. Rather than fine-tuning all layers at once, which risks catastrophic forgetting, ULMFiT gradually unfreezes the model starting from the last layer (i.e., closest to the output) as this contains the least general knowledge. First the last layer is unfrozen and all unfrozen layers are fine-tuned for one epoch. Then the next group of frozen layers is unfrozen and fine-tuned and repeat, until all layers are fine-tuned until convergence at the last iteration.",
  "title": "Universal Language Model Fine-tuning for Text Classification",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "Experience Replay",
  "full_name": "Experience Replay",
  "description": "**Experience Replay** is a replay memory technique used in reinforcement learning where we store the agent’s experiences at each time-step, $e\\_{t} = \\left(s\\_{t}, a\\_{t}, r\\_{t}, s\\_{t+1}\\right)$ in a data-set $D = e\\_{1}, \\cdots, e\\_{N}$ , pooled over many episodes into a replay memory. We then usually sample the memory randomly for a minibatch of experience, and use this to learn off-policy, as with Deep Q-Networks. This tackles the problem of autocorrelation leading to unstable training, by making the problem more like a supervised learning problem.\r\n\r\nImage Credit: [Hands-On Reinforcement Learning with Python, Sudharsan Ravichandiran](https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781788836524)",
  "title": null,
  "collection": "Replay Memory",
  "area": "Reinforcement Learning"
}
{
  "name": "DeeBERT",
  "full_name": "DeeBERT",
  "description": "**DeeBERT** is a method for accelerating [BERT](https://paperswithcode.com/method/bert) inference. It inserts extra classification layers (which are referred to as off-ramps) between each [transformer](https://paperswithcode.com/method/transformer) layer of BERT. All transformer layers and off-ramps are jointly fine-tuned on a given downstream dataset. At inference time, after a sample goes through a transformer layer, it is passed to the following off-ramp. If the off-ramp is confident of the prediction, the result is returned; otherwise, the sample is sent to the next transformer layer.",
  "title": "DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Heatmap",
  "full_name": "Heatmap",
  "description": "",
  "title": "Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation",
  "collection": "Output Functions",
  "area": "General"
}
{
  "name": "GBlock",
  "full_name": "GBlock",
  "description": "**GBlock** is a type of [residual block](https://paperswithcode.com/method/residual-block) used in the [GAN-TTS](https://paperswithcode.com/method/gan-tts) text-to-speech architecture - it is a stack of two residual blocks. As the generator is producing raw audio (e.g. a 2s training clip corresponds\r\nto a sequence of 48000 samples), dilated convolutions are used to ensure that the receptive field of $G$ is large enough to capture long-term dependencies. The four kernel size-3 convolutions in each GBlock have increasing dilation factors: 1, 2, 4, 8. Convolutions are preceded by Conditional Batch Normalisation, conditioned on the linear embeddings of the noise term $z \\sim N\\left(0, \\mathbf{I}\\_{128}\\right)$ in the single-speaker case, or the concatenation of $z$ and a one-hot representation of the speaker ID in the multi-speaker case. The embeddings are different for\r\neach BatchNorm instance. \r\n\r\nA GBlock contains two skip connections, the first of which in [GAN](https://paperswithcode.com/method/gan)-TTS performs upsampling if the output frequency is higher than the input, and it also contains a size-1 [convolution](https://paperswithcode.com/method/convolution)\r\nif the number of output channels is different from the input.",
  "title": "High Fidelity Speech Synthesis with Adversarial Networks",
  "collection": "Audio Model Blocks",
  "area": "Audio"
}
{
  "name": "F2DNet",
  "full_name": "Fast Focal Detection Network",
  "description": "F2DNet, a novel two-stage object detection architecture which eliminates redundancy of classical two-stage detectors by replacing the region proposal network with focal detection network and\r\nbounding box head with fast suppression head.",
  "title": "F2DNet: Fast Focal Detection Network for Pedestrian Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "DenseNAS-A",
  "full_name": "DenseNAS-A",
  "description": "**DenseNAS-A** is a mobile convolutional neural network discovered through the [DenseNAS](https://paperswithcode.com/method/densenas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) method. The basic building block is MBConvs, or inverted bottleneck residuals, from the MobileNet architectures.",
  "title": "Densely Connected Search Space for More Flexible Neural Architecture Search",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "LR-Net",
  "full_name": "LR-Net",
  "description": "An **LR-Net** is a type of non-convolutional neural network that utilises local relation layers instead of convolutions for image feature extraction. Otherwise, the architecture follows the same design as a [ResNet](https://paperswithcode.com/method/resnet).",
  "title": null,
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "GNNCL",
  "full_name": "Graph Neural Networks with Continual Learning",
  "description": "Although significant effort has been applied to fact-checking, the prevalence of fake news over social media, which has profound impact on justice, public trust and our society, remains a serious problem. In this work, we focus on propagation-based fake news detection, as recent studies have demonstrated that fake news and real news spread differently online. Specifically, considering the capability of graph neural networks (GNNs) in dealing with non-Euclidean data, we use GNNs to differentiate between the propagation patterns of fake and real news on social media. In particular, we concentrate on two questions: (1) Without relying on any text information, e.g., tweet content, replies and user descriptions, how accurately can GNNs identify fake news? Machine learning models are known to be vulnerable to adversarial attacks, and avoiding the dependence on text-based features can make the model less susceptible to the manipulation of advanced fake news fabricators. (2) How to deal with new, unseen data? In other words, how does a GNN trained on a given dataset perform on a new and potentially vastly different dataset? If it achieves unsatisfactory performance, how do we solve the problem without re-training the model on the entire data from scratch? We study the above questions on two datasets with thousands of labelled news items, and our results show that: (1) GNNs can achieve comparable or superior performance without any text information to state-of-the-art methods. (2) GNNs trained on a given dataset may perform poorly on new, unseen data, and direct incremental training cannot solve the problem---this issue has not been addressed in the previous work that applies GNNs for fake news detection. In order to solve the problem, we propose a method that achieves balanced performance on both existing and new datasets, by using techniques from continual learning to train GNNs incrementally.",
  "title": "Graph Neural Networks with Continual Learning for Fake News Detection from Social Media",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "TLC",
  "full_name": "Test-time Local Converter",
  "description": "TLC convert the global operation to a local one so that it extract representations based on local spatial region of features as in training phase.",
  "title": "Improving Image Restoration by Revisiting Global Information Aggregation",
  "collection": "Image Restoration Models",
  "area": "Computer Vision"
}
{
  "name": "Weight Demodulation",
  "full_name": "Weight Demodulation",
  "description": "**Weight Modulation** is an alternative to [adaptive instance normalization](https://paperswithcode.com/method/adaptive-instance-normalization) for use in generative adversarial networks, specifically it is introduced in [StyleGAN2](https://paperswithcode.com/method/stylegan2). The purpose of [instance normalization](https://paperswithcode.com/method/instance-normalization) is to remove the effect of $s$ - the scales of the features maps - from the statistics of the [convolution](https://paperswithcode.com/method/convolution)’s output feature maps. Weight modulation tries to achieve this goal more directly. Assuming that input activations are i.i.d. random variables with unit standard deviation. After modulation and convolution, the output activations have standard deviation of:\r\n\r\n$$ \\sigma\\_{j} = \\sqrt{{\\sum\\_{i,k}w\\_{ijk}'}^{2}} $$\r\n\r\ni.e., the outputs are scaled by the $L\\_{2}$ norm of the corresponding weights. The subsequent normalization aims to restore the outputs back to unit standard deviation. This can be achieved if we scale (“demodulate”) each output feature map $j$ by $1/\\sigma\\_{j}$ . Alternatively, we can again bake this into the convolution weights:\r\n\r\n$$ w''\\_{ijk} = w'\\_{ijk} / \\sqrt{{\\sum\\_{i, k}w'\\_{ijk}}^{2} + \\epsilon} $$\r\n\r\nwhere $\\epsilon$ is a small constant to avoid numerical issues.",
  "title": "Analyzing and Improving the Image Quality of StyleGAN",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "Deep-MAC",
  "full_name": "Deep-MAC",
  "description": "**Deep-MAC**, or **Deep Mask-heads Above CenterNet**, is a type of anchor-free instance segmentation model based on [CenterNet](https://paperswithcode.com/method/centernet).  The motivation for this new architecture is that boxes are much cheaper to annotate than masks, so the authors address the “partially supervised” instance segmentation problem, where all classes have bounding box annotations but only a subset of classes have mask annotations. \r\n\r\nFor predicting bounding boxes, CenterNet outputs 3 tensors: (1) a class-specific [heatmap](https://paperswithcode.com/method/heatmap) which indicates the probability of the center of a bounding box being present at each location, (2) a class-agnostic 2-channel tensor indicating the height and width of the bounding box at each center pixel, and (3) since the output feature map is typically smaller than the image (stride 4 or 8), CenterNet also predicts an x and y direction offset to recover this discretization error at each center pixel.\r\n\r\nFor Deep-MAC, in parallel to the box-related prediction heads, we add a fourth pixel embedding branch $P$. For each bounding box\r\n$b$, we crop a region $P\\_{b}$ from $P$ corresponding to $b$ via [ROIAlign](https://paperswithcode.com/method/roi-align) which results in a 32 × 32 tensor. We then feed each $P\\_{b}$ to a mask-head. The final prediction at the end is a class-agnostic 32 × 32 tensor which we pass through a sigmoid to get per-pixel probabilities. We train this mask-head via a per-pixel cross-entropy loss averaged over all pixels and instances. During post-processing, the predicted mask is re-aligned according to the predicted box and resized to the resolution of the image. \r\n\r\nIn addition to this 32 × 32 cropped feature map, we add two inputs for improved stability of some mask-heads: (1) Instance embedding: an additional head is added to the backbone that predicts a per-pixel embedding. For each bounding box $b$ we extract its embedding from the center pixel. This embedding is tiled to a size of 32 × 32 and concatenated to the pixel embedding crop. This helps condition the mask-head on a particular instance and disambiguate it from others. (2) Coordinate Embedding: Inspired by [CoordConv](https://paperswithcode.com/method/coordconv), the authors add a 32 × 32 × 2 tensor holding normalized $\\left(x, y\\right)$ coordinates relative to the bounding box $b$.",
  "title": "The surprising impact of mask-head architecture on novel class segmentation",
  "collection": "Instance Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "RegNetX",
  "full_name": "RegNetX",
  "description": "**RegNetX** is a convolutional network design space with simple, regular models with parameters: depth $d$, initial width $w\\_{0} > 0$, and slope $w\\_{a} > 0$, and generates a different block width $u\\_{j}$ for each block $j < d$. The key restriction for the RegNet types of model is that there is a linear parameterisation of block widths (the design space only contains models with this linear structure):\r\n\r\n$$ u\\_{j} = w\\_{0} + w\\_{a}\\cdot{j} $$\r\n\r\nFor **RegNetX** we have additional restrictions: we set $b = 1$ (the bottleneck ratio), $12 \\leq d \\leq 28$, and $w\\_{m} \\geq 2$ (the width multiplier).",
  "title": "Designing Network Design Spaces",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Clipped Double Q-learning",
  "full_name": "Clipped Double Q-learning",
  "description": "**Clipped Double Q-learning** is a variant on [Double Q-learning](https://paperswithcode.com/method/double-q-learning) that upper-bounds the less biased Q estimate $Q\\_{\\theta\\_{2}}$ by the biased estimate $Q\\_{\\theta\\_{1}}$. This is equivalent to taking the minimum of the two estimates, resulting in the following target update:\r\n\r\n$$ y\\_{1} = r + \\gamma\\min\\_{i=1,2}Q\\_{\\theta'\\_{i}}\\left(s', \\pi\\_{\\phi\\_{1}}\\left(s'\\right)\\right) $$\r\n\r\nThe motivation for this extension is that vanilla double [Q-learning](https://paperswithcode.com/method/q-learning) is sometimes ineffective if the target and current networks are too similar, e.g. with a slow-changing policy in an actor-critic framework.",
  "title": "Addressing Function Approximation Error in Actor-Critic Methods",
  "collection": "Off-Policy TD Control",
  "area": "Reinforcement Learning"
}
{
  "name": "SNN",
  "full_name": "Spiking Neural Networks",
  "description": "**Spiking Neural Networks** (**SNNs**)  are a class of artificial neural networks inspired by the structure and functioning of the brain's neural networks. Unlike traditional artificial neural networks that operate based on continuous firing rates, SNNs simulate the behavior of individual neurons through discrete spikes or action potentials. These spikes are triggered when the neuron's membrane potential reaches a certain threshold, and they propagate through the network, communicating information and triggering subsequent neuron activations. This spike-based communication allows SNNs to capture the temporal dynamics of information processing and exhibit asynchronous, event-driven behavior, making them well-suited for tasks such as temporal pattern recognition, event detection, and real-time processing. SNNs have gained attention due to their potential in efficiently processing and encoding information, offering advantages in energy efficiency, robustness, and compatibility with neuromorphic hardware architectures.",
  "title": "Self-Normalizing Neural Networks",
  "collection": null,
  "area": null
}
{
  "name": "CTAL",
  "full_name": "CTAL",
  "description": "**CTAL** is a pre-training framework for strong audio-and-language representations with a [Transformer](https://paperswithcode.com/method/transformer), which aims to learn the intra-modality and inter-modalities connections between audio and language through two proxy tasks on a large amount of audio- and-language pairs: masked language modeling and masked cross-modal acoustic modeling. The pre-trained model is a Transformer for Audio and Language, i.e., CTAL, which consists of two modules, a language stream encoding module which adapts word as input element, and a text-referred audio stream encoder module which accepts both frame-level Mel-spectrograms and token-level output embeddings from the language stream",
  "title": "CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations",
  "collection": "Generative Audio Models",
  "area": "Audio"
}
{
  "name": "ARCH",
  "full_name": "Animatable Reconstruction of Clothed Humans",
  "description": "**Animatable Reconstruction of Clothed Humans** is an end-to-end framework for accurate reconstruction of animation-ready 3D clothed humans from a monocular image. ARCH is a learned pose-aware model that produces detailed 3D rigged full-body human avatars from a single unconstrained RGB image. A Semantic Space and a Semantic Deformation Field are created using a parametric 3D body estimator. They allow the transformation of 2D/3D clothed humans into a canonical space, reducing ambiguities in geometry caused by pose variations and occlusions in training data. Detailed surface geometry and appearance are learned using an implicit function representation with spatial local features.",
  "title": "ARCH: Animatable Reconstruction of Clothed Humans",
  "collection": "3D Reconstruction",
  "area": "Computer Vision"
}
{
  "name": "ALAE",
  "full_name": "Adversarial Latent Autoencoder",
  "description": "**ALAE**, or **Adversarial Latent Autoencoder**, is a type of autoencoder that attempts to overcome some of the limitations of[ generative adversarial networks](https://paperswithcode.com/paper/generative-adversarial-networks). The architecture allows the latent distribution to be learned from data to address entanglement (A). The output data distribution is learned with an adversarial strategy (B). Thus, we retain the generative properties of GANs, as well as the ability to build on the recent advances in this area. For instance, we can include independent sources of stochasticity, which have proven essential for generating image details, or can leverage recent improvements on GAN loss functions, regularization, and hyperparameters tuning. Finally, to implement (A) and (B), AE reciprocity is imposed in the latent space (C). Therefore, we can avoid using reconstruction losses based on simple $\\mathcal{l}\\){2}$ norm that operates in data space, where they are often suboptimal, like for the image space. Since it works on the latent space, rather than autoencoding the data space, the approach is named Adversarial Latent Autoencoder (ALAE).",
  "title": "Adversarial Latent Autoencoders",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "RotatE",
  "full_name": "RotatE",
  "description": "**RotatE** is a method for generating graph embeddings which is able to model and infer various relation patterns including: symmetry/antisymmetry, inversion, and composition. Specifically, the RotatE model defines each relation as a rotation from the source entity to the target entity in the complex vector space. The RotatE model is trained using a [self-adversarial negative sampling](https://paperswithcode.com/method/self-adversarial-negative-sampling) technique.",
  "title": "RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "RAHP",
  "full_name": "Review-guided Answer Helpfulness Prediction",
  "description": "**Review-guided Answer Helpfulness Prediction** (RAHP) is a textual inference model for identifying helpful answers in e-commerce. It not only considers the interactions between QA pairs, but also investigates the opinion coherence between the answer and crowds' opinions reflected in the reviews, which is another important factor to identify helpful answers.",
  "title": "Review-guided Helpful Answer Identification in E-commerce",
  "collection": "Textual Inference Models",
  "area": "Natural Language Processing"
}
{
  "name": "Dense Block",
  "full_name": "Dense Block",
  "description": "A **Dense Block** is a module used in convolutional neural networks that connects *all layers* (with matching feature-map sizes) directly with each other. It was originally proposed as part of the [DenseNet](https://paperswithcode.com/method/densenet) architecture. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. In contrast to [ResNets](https://paperswithcode.com/method/resnet), we never combine features through summation before they are passed into a layer; instead, we combine features by concatenating them. Hence, the $\\ell^{th}$ layer has $\\ell$ inputs, consisting of the feature-maps of all preceding convolutional blocks. Its own feature-maps are passed on to all $L-\\ell$ subsequent layers. This introduces $\\frac{L(L+1)}{2}$  connections in an $L$-layer network, instead of just $L$, as in traditional architectures: \"dense connectivity\".",
  "title": "Densely Connected Convolutional Networks",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Lookahead",
  "full_name": "Lookahead",
  "description": "**Lookahead** is a type of stochastic optimizer that iteratively updates two sets of weights: \"fast\" and \"slow\". Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of *fast weights* generated by another optimizer.\r\n\r\n\r\n\r\n**Algorithm 1** Lookahead Optimizer\r\n\r\n**Require** Initial parameters $\\phi_0$, objective function $L$ \r\n\r\n**Require** Synchronization period $k$, slow weights step size $\\alpha$, optimizer $A$\r\n\r\n&nbsp;&nbsp;  **for** $t=1, 2, \\dots$\r\n\r\n&nbsp;&nbsp;&nbsp;&nbsp; Synchronize parameters $\\theta_{t,0} \\gets \\phi_{t-1}$\r\n\r\n&nbsp;&nbsp;&nbsp;&nbsp; **for** $i=1, 2, \\dots, k$\r\n\r\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sample minibatch of data $d \\sim \\mathcal{D}$\r\n\r\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $\\theta_{t,i} \\gets \\theta_{t,i-1} + A(L, \\theta_{t,i-1}, d)$\r\n\r\n&nbsp;&nbsp;&nbsp;&nbsp; **endfor**\r\n\r\n&nbsp;&nbsp;&nbsp;&nbsp; Perform outer update $\\phi_t \\gets \\phi_{t-1} + \\alpha (\\theta_{t,k} - \\phi_{t-1})$\r\n\r\n&nbsp;&nbsp; **endfor**\r\n\r\n&nbsp;&nbsp; **return** parameters $\\phi$",
  "title": "Lookahead Optimizer: k steps forward, 1 step back",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Grid R-CNN",
  "full_name": "Grid R-CNN",
  "description": "**Grid R-CNN** is an object detection framework, where the traditional regression\r\nformulation is replaced by a grid point guided localization mechanism.\r\n\r\nGrid R-CNN divides the object bounding box region into grids and employs a fully convolutional network ([FCN](https://paperswithcode.com/method/fcn)) to predict the locations of grid points. Owing to the position sensitive property of fully convolutional architecture, Grid R-CNN maintains the explicit spatial information and grid points locations can be obtained in pixel level. When a certain number of grid points at specified location are known, the corresponding bounding box is definitely determined. Guided by the grid points, Grid R-CNN can determine more accurate object bounding box than regression method which lacks the guidance of explicit spatial information.",
  "title": "Grid R-CNN",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "EsViT",
  "full_name": "EsViT",
  "description": "**EsViT** proposes two techniques for developing efficient self-supervised vision transformers for visual representation leaning: a multi-stage architecture with sparse self-attention and a new pre-training task of region matching. The multi-stage architecture reduces modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. The new pretraining task allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations.",
  "title": "Efficient Self-supervised Vision Transformers for Representation Learning",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "PFGM",
  "full_name": "Poisson Flow Generative Models",
  "description": "",
  "title": "Poisson Flow Generative Models",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Re-Attention Module",
  "full_name": "Re-Attention Module",
  "description": "The **Re-Attention Module** is an attention layer used in the [DeepViT](https://paperswithcode.com/method/deepvit) architecture which mixes the attention map with a learnable matrix before multiplying with the values. The motivation is to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The authors note that traditional self-attention fails to learn effective concepts for representation learning in deeper layers of ViT -- attention maps become more similar and less diverse in deeper layers (attention collapse) - and this hinders the model from getting expected performance gain. Re-attention is implemented by:\r\n\r\n$$\r\n\\operatorname{Re}-\\operatorname{Attention}(Q, K, V)=\\operatorname{Norm}\\left(\\Theta^{\\top}\\left(\\operatorname{Softmax}\\left(\\frac{Q K^{\\top}}{\\sqrt{d}}\\right)\\right)\\right) V\r\n$$\r\n\r\nwhere transformation matrix $\\Theta$ is multiplied to the self-attention map $\\textbf{A}$ along the head dimension.",
  "title": "DeepViT: Towards Deeper Vision Transformer",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "CrossTransformers",
  "full_name": "CrossTransformers",
  "description": "CrossTransformers is a Transformer-based neural network architecture which can take a small number of labeled images and an unlabeled query, find coarse spatial correspondence between the query and the labeled images, and then infer class membership by computing distances between spatially-corresponding features.",
  "title": "CrossTransformers: spatially-aware few-shot transfer",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "InPlace-ABN",
  "full_name": "In-Place Activated Batch Normalization",
  "description": "**In-Place Activated Batch Normalization**, or **InPlace-ABN**, substitutes the conventionally used succession of [BatchNorm](https://paperswithcode.com/method/batch-normalization) + Activation layers with a single plugin layer, hence avoiding invasive framework surgery while providing straightforward applicability for existing deep learning frameworks. It approximately halves the memory requirements during training of modern deep learning models.",
  "title": "In-Place Activated BatchNorm for Memory-Optimized Training of DNNs",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "ICA",
  "full_name": "Independent Component Analysis",
  "description": "_**Independent component analysis** (ICA) is a statistical and computational technique for revealing hidden factors that underlie sets of random variables, measurements, or signals._\r\n\r\n_ICA defines a generative model for the observed multivariate data, which is typically given as a large database of samples. In the model, the data variables are assumed to be linear mixtures of some unknown latent variables, and the mixing system is also unknown. The latent variables are assumed nongaussian and mutually independent, and they are called the independent components of the observed data. These independent components, also called sources or factors, can be found by ICA._\r\n\r\n_ICA is superficially related to principal component analysis and factor analysis. ICA is a much more powerful technique, however, capable of finding the underlying factors or sources when these classic methods fail completely._\r\n\r\n\r\nExtracted from (https://www.cs.helsinki.fi/u/ahyvarin/whatisica.shtml)\r\n\r\n**Source papers**:\r\n\r\n[Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture](https://doi.org/10.1016/0165-1684(91)90079-X)\r\n\r\n[Independent component analysis, A new concept?](https://doi.org/10.1016/0165-1684(94)90029-9)\r\n\r\n[Independent component analysis: algorithms and applications](https://doi.org/10.1016/S0893-6080(00)00026-5)",
  "title": null,
  "collection": "Dimensionality Reduction",
  "area": "General"
}
{
  "name": "BART",
  "full_name": "BART",
  "description": "**BART** is a [denoising autoencoder](https://paperswithcode.com/method/denoising-autoencoder) for pretraining sequence-to-sequence models. It is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard [Transformer](https://paperswithcode.com/method/transformer)-based neural machine translation architecture. It uses a standard seq2seq/NMT architecture with a bidirectional encoder (like [BERT](https://paperswithcode.com/method/bert)) and a left-to-right decoder (like [GPT](https://paperswithcode.com/method/gpt)). This means the encoder's attention mask is fully visible, like BERT, and the decoder's attention mask is causal, like [GPT2](https://paperswithcode.com/method/gpt-2).",
  "title": "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "CCAC",
  "full_name": "Confidence Calibration with an Auxiliary Class)",
  "description": "**Confidence Calibration with an Auxiliary Class**, or **CCAC**, is a post-hoc confidence calibration method for DNN classifiers on OOD datasets. The key feature of CCAC is an auxiliary class in the calibration model which separates mis-classified samples from correctly classified ones, thus effectively mitigating the target DNN’s being confidently wrong. It also reduces the number of free parameters in CCAC to reduce free parameters and facilitate transfer to a new unseen dataset.",
  "title": "Calibrating Deep Neural Network Classifiers on Out-of-Distribution Datasets",
  "collection": "Confidence Calibration",
  "area": "General"
}
{
  "name": "1x1 Convolution",
  "full_name": "1x1 Convolution",
  "description": "A **1 x 1 Convolution** is a [convolution](https://paperswithcode.com/method/convolution) with some special properties in that it can be used for dimensionality reduction, efficient low dimensional embeddings, and applying non-linearity after convolutions. It maps an input pixel with all its channels to an output pixel which can be squeezed to a desired output depth. It can be viewed as an [MLP](https://paperswithcode.com/method/feedforward-network) looking at a particular pixel location.\r\n\r\nImage Credit: [http://deeplearning.ai](http://deeplearning.ai)",
  "title": "Network In Network",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "RandomRotate",
  "full_name": "RandomRotate",
  "description": "**RandomRotate** is a type of image data augmentation where we randomly rotate the image by a degree.",
  "title": null,
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "AMSGrad",
  "full_name": "AMSGrad",
  "description": "**AMSGrad** is a stochastic optimization method that seeks to fix a convergence issue with [Adam](https://paperswithcode.com/method/adam) based optimizers. AMSGrad uses the maximum of past squared gradients \r\n$v\\_{t}$ rather than the exponential average to update the parameters:\r\n\r\n$$m\\_{t} = \\beta\\_{1}m\\_{t-1} + \\left(1-\\beta\\_{1}\\right)g\\_{t} $$\r\n\r\n$$v\\_{t} = \\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g\\_{t}^{2}$$\r\n\r\n$$ \\hat{v}\\_{t} = \\max\\left(\\hat{v}\\_{t-1}, v\\_{t}\\right) $$\r\n\r\n$$\\theta\\_{t+1} = \\theta\\_{t} - \\frac{\\eta}{\\sqrt{\\hat{v}_{t}} + \\epsilon}m\\_{t}$$",
  "title": "On the Convergence of Adam and Beyond",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "NON",
  "full_name": "Network On Network",
  "description": "Network On Network (NON) is practical tabular data classification model based on deep neural network to provide accurate predictions. Various deep methods have been proposed and promising progress has been made. However, most of them use operations like neural network and factorization machines to fuse the embeddings of different features directly, and linearly combine the outputs of those operations to get the final prediction. As a result, the intra-field information and the non-linear interactions between those operations (e.g. neural network and factorization machines) are ignored. Intra-field information is the information that features inside each field belong to the same field. NON is proposed to take full advantage of intra-field information and non-linear interactions. It consists of three components: field-wise network at the bottom to capture the intra-field information, across field network in the middle to choose suitable operations data-drivenly, and operation fusion network on the top to fuse outputs of the chosen operations deeply",
  "title": "Network On Network for Tabular Data Classification in Real-world Applications",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "WaveGAN",
  "full_name": "WaveGAN",
  "description": "**WaveGAN** is a generative adversarial network for unsupervised synthesis of raw-waveform audio (as opposed to image-like spectrograms). \r\n\r\nThe WaveGAN architecture is based off [DCGAN](https://paperswithcode.com/method/dcgan). The DCGAN generator uses the [transposed convolution](https://paperswithcode.com/method/transposed-convolution) operation to iteratively upsample low-resolution feature maps into a high-resolution image. WaveGAN modifies this transposed [convolution](https://paperswithcode.com/method/convolution) operation to widen its receptive field, using a longer one-dimensional filters of length 25 instead of two-dimensional filters of size 5x5, and upsampling by a factor of 4 instead of 2 at each layer. The discriminator is modified in a similar way, using length-25 filters in one dimension and increasing stride\r\nfrom 2 to 4. These changes result in WaveGAN having the same number of parameters, numerical\r\noperations, and output dimensionality as DCGAN. An additional layer is added afterwards to allow for more audio samples. Further changes include:\r\n\r\n1. Flattening 2D convolutions into 1D (e.g. 5x5 2D conv becomes length-25 1D).\r\n2. Increasing the stride factor for all convolutions (e.g. stride 2x2 becomes stride 4).\r\n3. Removing [batch normalization](https://paperswithcode.com/method/batch-normalization) from the generator and discriminator.\r\n4. Training using the [WGAN](https://paperswithcode.com/method/wgan)-GP strategy.",
  "title": "Adversarial Audio Synthesis",
  "collection": "Generative Audio Models",
  "area": "Audio"
}
{
  "name": "SELU",
  "full_name": "Scaled Exponential Linear Unit",
  "description": "**Scaled Exponential Linear Units**, or **SELUs**, are activation functions that induce self-normalizing properties.\r\n\r\nThe SELU activation function is given by \r\n\r\n$$f\\left(x\\right) = \\lambda{x} \\text{ if } x \\geq{0}$$\r\n$$f\\left(x\\right) = \\lambda{\\alpha\\left(\\exp\\left(x\\right) -1 \\right)} \\text{ if } x < 0 $$\r\n\r\nwith $\\alpha \\approx 1.6733$ and $\\lambda \\approx 1.0507$.",
  "title": "Self-Normalizing Neural Networks",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "ROME",
  "full_name": "Rank-One Model Editing",
  "description": "",
  "title": "Locating and Editing Factual Associations in GPT",
  "collection": "Parameter Norm Penalties",
  "area": "General"
}
{
  "name": "DCNN",
  "full_name": "Diffusion-Convolutional Neural Networks",
  "description": "Diffusion-convolutional neural networks (DCNN) is a model for graph-structured data. Through the introduction of a diffusion-convolution operation, diffusion-based representations can be learned from graph structured data and used as an effective basis for node classification.\r\n\r\nDescription and image from: [Diffusion-Convolutional Neural Networks](https://arxiv.org/pdf/1511.02136.pdf)",
  "title": "Diffusion-Convolutional Neural Networks",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "G-NIA",
  "full_name": "Generalizable Node Injection Attack",
  "description": "**Generalizable Node Injection Attack**, or **G-NIA**, is an attack scenario for graph neural networks where the attacker injects malicious nodes rather than modifying original nodes or edges to affect the performance of GNNs. G-NIA generates the discrete edges also by Gumbel-Top-𝑘 following OPTI and captures the coupling effect between network structure and node features by a sophisticated designed model. \r\n\r\n G-NIA explicitly models the most critical feature propagation via jointly modeling. Specifically, the malicious attributes are adopted to guide the generation of edges, modeling the influence of attributes and edges. G-NIA also adopts a model-based framework, utilizing useful information of attacking during model training, as well as saving computational cost during inference without re-optimization.",
  "title": "Single Node Injection Attack against Graph Neural Networks",
  "collection": "Adversarial Attacks",
  "area": "General"
}
{
  "name": "DiffPool",
  "full_name": "DiffPool",
  "description": "DiffPool is a differentiable graph pooling module that can generate hierarchical representations of graphs and can be combined with various graph neural network architectures in an end-to-end fashion. DiffPool learns a differentiable soft cluster assignment for nodes at each layer of a deep GNN, mapping nodes to a set of clusters, which then form the coarsened input for the next GNN layer.\r\n\r\nDescription and image from: [Hierarchical Graph Representation Learning with Differentiable Pooling](https://arxiv.org/pdf/1806.08804.pdf)",
  "title": "Hierarchical Graph Representation Learning with Differentiable Pooling",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "Residual Connection",
  "full_name": "Residual Connection",
  "description": "**Residual Connections** are a type of skip-connection that learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. \r\n\r\nFormally, denoting the desired underlying mapping as $\\mathcal{H}({x})$, we let the stacked nonlinear layers fit another mapping of $\\mathcal{F}({x}):=\\mathcal{H}({x})-{x}$. The original mapping is recast into $\\mathcal{F}({x})+{x}$.\r\n\r\nThe intuition is that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.",
  "title": "Deep Residual Learning for Image Recognition",
  "collection": "Skip Connections",
  "area": "General"
}
{
  "name": "Feedback Transformer",
  "full_name": "Feedback Transformer",
  "description": "A **Feedback Transformer** is a type of sequential transformer that exposes all previous representations to all future representations, meaning the lowest representation of the current timestep is formed from the highest-level abstract representation of the past. This feedback nature allows this architecture to perform recursive computation, building stronger representations iteratively upon previous states. To achieve this, the self-attention mechanism of the standard [Transformer](https://paperswithcode.com/method/transformer) is modified so it attends to higher level representations rather than lower ones.",
  "title": "Addressing Some Limitations of Transformers with Feedback Memory",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Shuffle-T",
  "full_name": "Shuffle Transformer",
  "description": "The **Shuffle Transformer Block** consists of the Shuffle Multi-Head Self-Attention module (ShuffleMHSA), the Neighbor-Window Connection module (NWC), and the MLP module. To introduce cross-window connections while maintaining the efficient computation of non-overlapping windows, a strategy which alternates between WMSA and Shuffle-WMSA in consecutive Shuffle Transformer blocks is proposed. The first window-based transformer block uses regular window partition strategy and the second window-based transformer block uses window-based selfattention with spatial shuffle. Besides, the Neighbor-Window Connection moduel (NWC) is added into each block for enhancing connections among neighborhood windows. Thus the proposed shuffle transformer block could build rich cross-window connections and augments representation. Finally, the consecutive Shuffle Transformer blocks are computed as:\r\n\r\n$$ x^{l}=\\mathbf{W M S A}\\left(\\mathbf{B N}\\left(z^{l-1}\\right)\\right)+z^{l-1} $$\r\n\r\n$$ y^{l}=\\mathbf{N W C}\\left(x^{l}\\right)+x^{l} $$\r\n\r\n$$ z^{l}=\\mathbf{M L P}\\left(\\mathbf{B N}\\left(y^{l}\\right)\\right)+y^{l} $$\r\n\r\n$$ x^{l+1}=\\mathbf{S h u f f l e - W M S A}\\left(\\mathbf{B N}\\left(z^{l}\\right)\\right)+z^{l} $$\r\n\r\n$$ y^{l+1}=\\mathbf{N W C}\\left(x^{l+1}\\right)+x^{l+1} $$\r\n\r\n$$ z^{l+1}=\\mathbf{M L P}\\left(\\mathbf{B N}\\left(y^{l+1}\\right)\\right)+y^{l+1} $$\r\n\r\nwhere $x^l$, $y^l$ and $z^l$ denote the output features of the (Shuffle-)WMSA module, the Neighbor-Window Connection module and the MLP module for block $l$, respectively; WMSA and Shuffle-WMSA denote\r\nwindow-based multi-head self-attention without/with spatial shuffle, respectively.",
  "title": "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "Temporal Activation Regularization",
  "full_name": "Temporal Activation Regularization",
  "description": "**Temporal Activation Regularization (TAR)** is a type of slowness regularization for [RNNs](https://paperswithcode.com/methods/category/recurrent-neural-networks) that penalizes differences between states that have been explored in the past. Formally we minimize:\r\n\r\n$$\\beta{L\\_{2}}\\left(h\\_{t} - h\\_{t+1}\\right)$$\r\n\r\nwhere $L\\_{2}$ is the $L\\_{2}$ norm, $h_{t}$ is the output of the RNN at timestep $t$, and $\\beta$ is a scaling coefficient.",
  "title": "Revisiting Activation Regularization for Language RNNs",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "MTS",
  "full_name": "Matching The Statements",
  "description": "",
  "title": "Matching The Statements: A Simple and Accurate Model for Key Point Analysis",
  "collection": "Sentence Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "Soft-NMS",
  "full_name": "Soft-NMS",
  "description": "Non-maximum suppression is an integral part of the object detection pipeline. First, it sorts all detection boxes on the basis of their scores. The detection box $M$ with the maximum score is selected and all other detection boxes with a significant overlap (using a pre-defined threshold)\r\nwith $M$ are suppressed. This process is recursively applied on the remaining boxes. As per the design of the algorithm, if an object lies within the predefined overlap threshold, it leads to a miss. \r\n\r\n**Soft-NMS** solves this problem by decaying the detection scores of all other objects as a continuous function of their overlap with M. Hence, no object is eliminated in this process.",
  "title": "Soft-NMS -- Improving Object Detection With One Line of Code",
  "collection": "Proposal Filtering",
  "area": "Computer Vision"
}
{
  "name": "CoordConv",
  "full_name": "CoordConv",
  "description": "A **CoordConv** layer is a simple extension to the standard convolutional layer. It has the same functional signature as a convolutional layer, but accomplishes the mapping by first concatenating extra channels to the incoming representation. These channels contain hard-coded coordinates, the most basic version of which is one channel for the $i$ coordinate and one for the $j$ coordinate.\r\n\r\nThe CoordConv layer keeps the properties of few parameters and efficient computation from convolutions, but allows the network to learn to keep or to discard translation invariance as is needed for the task being learned. This is useful for coordinate transform based tasks where regular convolutions can fail.",
  "title": "An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "HTC",
  "full_name": "Hybrid Task Cascade",
  "description": "**Hybrid Task Cascade**, or **HTC**, is a framework for cascading in instance segmentation. It differs from [Cascade Mask R-CNN](https://paperswithcode.com/method/cascade-mask-r-cnn) in two important aspects:  (1) instead of performing cascaded refinement on the two tasks of detection and segmentation separately, it interweaves them for a joint multi-stage processing; (2) it adopts a fully convolutional branch to provide spatial context, which can help distinguishing hard foreground from cluttered background.",
  "title": "Hybrid Task Cascade for Instance Segmentation",
  "collection": "Instance Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "Hard Sigmoid",
  "full_name": "Hard Sigmoid",
  "description": "The **Hard Sigmoid** is an activation function used for neural networks of the form:\r\n\r\n$$f\\left(x\\right) = \\max\\left(0, \\min\\left(1,\\frac{\\left(x+1\\right)}{2}\\right)\\right)$$\r\n\r\nImage Source: [Rinat Maksutov](https://towardsdatascience.com/deep-study-of-a-not-very-deep-neural-network-part-2-activation-functions-fd9bd8d406fc)",
  "title": "BinaryConnect: Training Deep Neural Networks with binary weights during propagations",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Variational Entanglement Detection",
  "full_name": "Variational Entanglement Detection",
  "description": "**Variational Entanglement Detection** is a variational quantum algorithm which uses criteria based on positive maps as a bridge and works as follows. Given an unknown target bipartite quantum state, it firstly decomposes the chosen positive map into a linear combination of NISQ implementable quantum operations. Then, it variationally estimates the minimal eigenvalue of the final state, obtained by executing these quantum operations on the target state and averaging the output states. Deterministic and probabilistic methods are proposed to compute the average. At last, it asserts that the target state is entangled if the optimized minimal eigenvalue is negative. VLNE builds upon a linear decomposition of the transpose map into Pauli terms and the recently proposed trace distance estimation algorithm. It variationally estimates the well-known logarithmic negativity entanglement measure and could be applied to quantify entanglement on near-term quantum devices.",
  "title": "Detecting and quantifying entanglement on near-term quantum devices",
  "collection": "Quantum Methods",
  "area": "General"
}
{
  "name": "Linear Warmup",
  "full_name": "Linear Warmup",
  "description": "**Linear Warmup** is a learning rate schedule where we linearly increase the learning rate from a low rate to a constant rate thereafter. This reduces volatility in the early stages of training.\r\n\r\nImage Credit: [Chengwei Zhang](https://www.dlology.com/about-me/)",
  "title": null,
  "collection": "Learning Rate Schedules",
  "area": "General"
}
{
  "name": "Class Attention",
  "full_name": "Class Attention",
  "description": "A **Class Attention** layer, or **CA Layer**, is an [attention mechanism](https://paperswithcode.com/methods/category/attention-mechanisms-1) for [vision transformers](https://paperswithcode.com/methods/category/vision-transformer) used in [CaiT](https://paperswithcode.com/method/cait) that aims to extract information from a set of processed patches. It is identical to a [self-attention layer](https://paperswithcode.com/method/scaled), except that it relies on the attention between (i) the class embedding $x_{\\text {class }}$ (initialized at CLS in the first CA) and (ii) itself plus the set of frozen patch embeddings $x_{\\text {patches }} .$ \r\n\r\nConsidering a network with $h$ heads and $p$ patches, and denoting by $d$ the embedding size, the multi-head class-attention is parameterized with several projection matrices, $W_{q}, W_{k}, W_{v}, W_{o} \\in \\mathbf{R}^{d \\times d}$, and the corresponding biases $b_{q}, b_{k}, b_{v}, b_{o} \\in \\mathbf{R}^{d} .$ With this notation, the computation of the CA residual block proceeds as follows. We first augment the patch embeddings (in matrix form) as $z=\\left[x_{\\text {class }}, x_{\\text {patches }}\\right]$. We then perform the projections:\r\n\r\n$$Q=W\\_{q} x\\_{\\text {class }}+b\\_{q}$$\r\n\r\n$$K=W\\_{k} z+b\\_{k}$$\r\n\r\n$$V=W\\_{v} z+b\\_{v}$$\r\n\r\nThe class-attention weights are given by\r\n\r\n$$\r\nA=\\operatorname{Softmax}\\left(Q . K^{T} / \\sqrt{d / h}\\right)\r\n$$\r\n\r\nwhere $Q . K^{T} \\in \\mathbf{R}^{h \\times 1 \\times p}$. This attention is involved in the weighted sum $A \\times V$ to produce the residual output vector\r\n\r\n$$\r\n\\operatorname{out}\\_{\\mathrm{CA}}=W\\_{o} A V+b\\_{o}\r\n$$\r\n\r\nwhich is in turn added to $x\\_{\\text {class }}$ for subsequent processing.",
  "title": "Going deeper with Image Transformers",
  "collection": "Attention",
  "area": "General"
}
{
  "name": "PELU",
  "full_name": "Parametric Exponential Linear Unit",
  "description": "**Parameterized Exponential Linear Units**, or **PELU**, is an activation function for neural networks. It involves learning a parameterization of [ELU](https://paperswithcode.com/method/elu) in order to learn the proper activation shape at each layer in a CNN. \r\n\r\nThe PELU has two additional parameters over the ELU:\r\n\r\n$$ f\\left(x\\right) = cx \\text{ if } x > 0 $$\r\n$$ f\\left(x\\right) = \\alpha\\exp^{\\frac{x}{b}} - 1 \\text{ if } x \\leq 0 $$\r\n\r\nWhere $a$, $b$, and $c > 0$. Here $c$ causes a change in the slope in the positive quadrant, $b$ controls the scale of the [exponential decay](https://paperswithcode.com/method/exponential-decay), and $\\alpha$ controls the saturation in the negative quadrant.\r\n\r\nSource: [Activation Functions](https://arxiv.org/pdf/1811.03378.pdf)",
  "title": "Parametric Exponential Linear Unit for Deep Convolutional Neural Networks",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "CapsNet",
  "full_name": "Capsule Network",
  "description": "**Capsule Network** is a machine learning system that is a type of artificial neural network that can be used to better model hierarchical relationships. The approach is an attempt to more closely mimic biological neural organization.",
  "title": "Dynamic Routing Between Capsules",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "AHAF",
  "full_name": "Adaptive Hybrid Activation Function",
  "description": "Trainable activation function as a sigmoid-based generalization of ReLU, Swish and SiLU.",
  "title": "Adaptive hybrid activation function for deep neural networks",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "ZeRO-Infinity",
  "full_name": "ZeRO-Infinity",
  "description": "**ZeRO-Infinity** is a sharded data parallel system that extends [ZeRO](https://paperswithcode.com/method/zero) with new innovations in heterogeneous memory access called the infinity offload engine. This allows ZeRO-Infinity to support massive model sizes on limited GPU resources by exploiting CPU and NVMe memory simultaneously. In addition, ZeRO-Infinity also introduces a novel GPU memory optimization technique called memory-centric tiling to support extremely large individual layers that would otherwise not fit in GPU memory even one layer at a time.",
  "title": "ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "Partition Filter Network",
  "full_name": "Partition Filter Network",
  "description": "**Partition Filter Network** is a framework designed specifically for joint entity and relation extraction. The framework consists of three components: partition filter encoder, NER unit and RE unit. In task units, we use table-filling for word pair prediction. Orange, yellow and green represents NER-related, shared and RE-related component or features. (b) Detailed depiction of partition filter encoder in one single time step. We decompose feature encoding into two steps: partition and filter (shown in the gray area). In partition, we first segment neurons into two task partitions and one shared partition. Then in filter, partitions are selected and combined to form task-specific features and shared features, filtering out information irrelevant to each task.",
  "title": "A Partition Filter Network for Joint Entity and Relation Extraction",
  "collection": "Relation Extraction Models",
  "area": "Natural Language Processing"
}
{
  "name": "Assemble-ResNet",
  "full_name": "Assemble-ResNet",
  "description": "**Assemble-ResNet** is a modification to the [ResNet](https://paperswithcode.com/method/resnet) architecture with several tweaks including using [ResNet-D](https://paperswithcode.com/method/resnet-d), channel attention, [anti-alias downsampling](https://paperswithcode.com/method/anti-alias-downsampling), and Big Little Networks.",
  "title": "Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Recurrent Entity Network",
  "full_name": "Recurrent Entity Network",
  "description": "The **Recurrent Entity Network** is equipped with a dynamic long-term memory which allows it to maintain and update a representation of the state of the world as it receives new data. For language understanding tasks, it can reason on-the-fly as it reads text, not just when it is required to answer a question or respond as is the case for a [Memory Network](https://paperswithcode.com/method/memory-network). Like a [Neural Turing Machine](https://paperswithcode.com/method/neural-turing-machine) or Differentiable Neural Computer, it maintains a fixed size memory and can learn to perform location and content-based read and write operations.  However, unlike those models it has a simple parallel  architecture in which several memory locations can be updated simultaneously. \r\n\r\nThe model consists of a fixed number of dynamic memory cells, each containing a vector key $w_j$ and a vector value (or content) $h_j$. Each cell is associated with its own processor, a simple gated recurrent network that may update the cell value given an input. If each cell learns to represent a concept or entity in the world, one can imagine a gating mechanism that, based on the key and content of the memory cells, will only modify the cells that concern the entities mentioned in the input. There is no direct interaction between the memory cells, hence the system can be seen as multiple identical processors functioning in parallel, with distributed local memory. \r\n\r\nThe sharing of these parameters reflects an invariance of these laws across object instances, similarly to how the [weight tying](https://paperswithcode.com/method/weight-tying) scheme in a CNN reflects an invariance of image statistics across locations. Their hidden state is updated only when new information relevant to their concept is received, and remains otherwise unchanged. The keys used in the addressing/gating mechanism also correspond to concepts or entities, but are modified only during learning, not during inference.",
  "title": "Tracking the World State with Recurrent Entity Networks",
  "collection": "Working Memory Models",
  "area": "General"
}
{
  "name": "Maxout",
  "full_name": "Maxout",
  "description": "The **Maxout Unit** is a generalization of the [ReLU](https://paperswithcode.com/method/relu) and the [leaky ReLU](https://paperswithcode.com/method/leaky-relu) functions. It is a piecewise linear function that returns the maximum of the inputs, designed to be used in conjunction with [dropout](https://paperswithcode.com/method/dropout). Both ReLU and leaky ReLU are special cases of Maxout. \r\n\r\n$$f\\left(x\\right) = \\max\\left(w^{T}\\_{1}x + b\\_{1}, w^{T}\\_{2}x + b\\_{2}\\right)$$\r\n\r\nThe main drawback of Maxout is that it is computationally expensive as it doubles the number of parameters for each neuron.",
  "title": "Maxout Networks",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "ProGAN",
  "full_name": "Progressively Growing GAN",
  "description": "**ProGAN**, or **Progressively Growing GAN**, is a generative adversarial network that utilises a progressively growing training approach. The idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses.",
  "title": "Progressive Growing of GANs for Improved Quality, Stability, and Variation",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "ALiBi",
  "full_name": "Attention with Linear Biases",
  "description": "**ALiBi**, or **Attention with Linear Biases**, is a [positioning method](https://paperswithcode.com/methods/category/position-embeddings) that allows [Transformer](https://paperswithcode.com/methods/category/transformers) language models to consume, at inference time, sequences which are longer than the ones they were trained on. \r\n\r\nALiBi does this without using actual position embeddings. Instead, computing the attention between a certain key and query, ALiBi penalizes the attention value that that query can assign to the key depending on how far away the key and query are. So when a key and query are close by, the penalty is very low, and when they are far away, the penalty is very high. \r\n\r\nThis method was motivated by the simple reasoning that words that are close-by matter much more than ones that are  far away.\r\n\r\nThis method is as fast as the sinusoidal or absolute embedding methods (the fastest positioning methods there are). It outperforms those methods and Rotary embeddings when evaluating sequences that are longer than the ones the model was trained on (this is known as extrapolation).",
  "title": "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation",
  "collection": "Inference Extrapolation",
  "area": "Natural Language Processing"
}
{
  "name": "Decorrelated Batch Normalization",
  "full_name": "Decorrelated Batch Normalization",
  "description": "**Decorrelated Batch Normalization (DBN)** \r\nis a normalization technique which not just centers and scales activations but whitens them. ZCA whitening instead of [PCA](https://paperswithcode.com/method/pca) whitening is employed since PCA whitening causes a problem called *stochastic axis swapping*, which is detrimental to learning.",
  "title": "Decorrelated Batch Normalization",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "FFMv1",
  "full_name": "Feature Fusion Module v1",
  "description": "**Feature Fusion Module v1** is a feature fusion module from the [M2Det](https://paperswithcode.com/method/m2det) object detection model, and feature fusion modules are crucial for constructing the final multi-level feature pyramid. They use [1x1 convolution](https://paperswithcode.com/method/1x1-convolution) layers to compress the channels of the input features and use concatenation operation to aggregate these feature map. FFMv1 takes two feature maps with different scales in backbone as input, it adopts one upsample operation to rescale the deep features to the same scale before the concatenation operation.",
  "title": "M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "OpenPose",
  "full_name": "OpenPose",
  "description": "",
  "title": "OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields",
  "collection": "Pose Estimation Models",
  "area": "Computer Vision"
}
{
  "name": "ProxyAnchorLoss",
  "full_name": "Proxy Anchor Loss for Deep Metric Learning",
  "description": "",
  "title": "Proxy Anchor Loss for Deep Metric Learning",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "Squared ReLU",
  "full_name": "Squared ReLU",
  "description": "**Squared ReLU** is an activation function used in the [Primer](https://paperswithcode.com/method/primer) architecture in the feedforward block of the [Transformer](https://paperswithcode.com/methods/category/transformers) layer. It is simply squared [ReLU](https://paperswithcode.com/method/relu) activations.\r\n\r\nThe effectiveness of higher order polynomials can also be observed in other effective [Transformer](https://paperswithcode.com/method/transformer) nonlinearities, such as [GLU](https://paperswithcode.com/method/glu) variants like [ReGLU](https://paperswithcode.com/method/reglu) and point-wise activations like [approximate GELU](https://paperswithcode.com/method/gelu). However, squared ReLU has drastically different asymptotics as $x \\rightarrow \\inf$ compared to the most commonly used activation functions: [ReLU](https://paperswithcode.com/method/relu), [GELU](https://paperswithcode.com/method/gelu) and [Swish](https://paperswithcode.com/method/swish). Squared ReLU does have significant overlap with ReGLU and in fact is equivalent when ReGLU’s $U$ and $V$ weight matrices are the same and squared ReLU is immediately preceded by a linear transformation with weight matrix $U$. This leads the authors to believe that squared ReLUs capture the benefits of these GLU variants, while being simpler, without additional parameters, and delivering better quality.",
  "title": "Primer: Searching for Efficient Transformers for Language Modeling",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "GRU",
  "full_name": "Gated Recurrent Unit",
  "description": "A **Gated Recurrent Unit**, or **GRU**, is a type of recurrent neural network. It is similar to an [LSTM](https://paperswithcode.com/method/lstm), but only has two gates - a reset gate and an update gate - and notably lacks an output gate. Fewer parameters means GRUs are generally easier/faster to train than their LSTM counterparts.\r\n\r\nImage Source: [here](https://www.google.com/url?sa=i&url=https%3A%2F%2Fcommons.wikimedia.org%2Fwiki%2FFile%3AGated_Recurrent_Unit%2C_type_1.svg&psig=AOvVaw3EmNX8QXC5hvyxeenmJIUn&ust=1590332062671000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCMiev9-eyukCFQAAAAAdAAAAABAR)",
  "title": "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "Transformer-XL",
  "full_name": "Transformer-XL",
  "description": "**Transformer-XL** (meaning extra long) is a [Transformer](https://paperswithcode.com/method/transformer) architecture that introduces the notion of recurrence to the deep self-attention network. Instead of computing the hidden states from scratch for each new segment, Transformer-XL reuses the hidden states obtained in previous segments. The reused hidden states serve as memory for the current segment, which builds up a recurrent connection between the segments. As a result, modeling very long-term dependency becomes possible because information can be propagated through the recurrent connections. As an additional contribution, the Transformer-XL uses a new relative positional encoding formulation that generalizes to attention lengths longer than the one observed during training.",
  "title": "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "MultiGrain",
  "full_name": "MultiGrain",
  "description": "**MultiGrain** is a type of image model that learns a single embedding for classes, instances and copies.  In other words, it is a convolutional neural network that is suitable for both image classification and instance retrieval. We learn MultiGrain by jointly training an image embedding for multiple tasks. The resulting representation is compact and can outperform narrowly-trained embeddings. The learned embedding output incorporates different levels of granularity.",
  "title": "MultiGrain: a unified image embedding for classes and instances",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "ReLU6",
  "full_name": "ReLU6",
  "description": "**ReLU6** is a modification of the [rectified linear unit](https://paperswithcode.com/method/relu) where we limit the activation to a maximum size of $6$. This is due to increased robustness when used with low-precision computation.\r\n\r\nImage Credit: [PyTorch](https://pytorch.org/docs/master/generated/torch.nn.ReLU6.html)",
  "title": "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "UFLoss",
  "full_name": "Unsupervised Feature Loss",
  "description": "**UFLoss**, or **Unsupervised Feature Loss**, is a patch-based unsupervised learned feature loss for deep learning (DL) based reconstructions. The UFLoss provides instance-level discrimination by mapping similar instances to similar low-dimensional feature vectors using a pre-trained mapping network (UFLoss Network). The rationale of using features from large-patches (typically 40×40 pixels for a 300×300 pixels image) is that we want the UFLoss to capture mid-level structural and semantic features instead of using small patches (typically around 10×10 pixels), which only contain local edge information. On the other hand, the authors avoid using global features due to the fact that the training set (typically around 5000 slices) is usually not large enough to capture common and general features at a large-image scale.",
  "title": "High Fidelity Deep Learning-based MRI Reconstruction with Instance-wise Discriminative Feature Matching Loss",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "PIoU Loss",
  "full_name": "PIoU Loss",
  "description": "**PIoU Loss** is a loss function for oriented object detection which is formulated to exploit both the angle and IoU for accurate oriented bounding box regression. The PIoU loss is derived from IoU metric with a pixel-wise form.",
  "title": "PIoU Loss: Towards Accurate Oriented Object Detection in Complex Environments",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "Grab",
  "full_name": "Grab",
  "description": "**Grab** is a sensor processing system for cashier-free shopping. Grab needs to accurately identify and track customers, and associate each shopper with items he or she retrieves from shelves. To do this, it uses a keypoint-based pose tracker as a building block for identification and tracking, develops robust feature-based face trackers, and algorithms for associating and tracking arm movements. It also uses a probabilistic framework to fuse readings from camera, weight and RFID sensors in order to accurately assess which shopper picks up which item.",
  "title": "Grab: Fast and Accurate Sensor Processing for Cashier-Free Shopping",
  "collection": "Cashier-Free Shopping",
  "area": "Computer Vision"
}
{
  "name": "Multiple Random Window Discriminator",
  "full_name": "Multiple Random Window Discriminator",
  "description": "**Multiple Random Window Discriminator** is a discriminator used for the [GAN-TTS](https://paperswithcode.com/method/gan-tts) text-to-speech architecture. These discriminators operate on randomly sub-sampled fragments of the real or generated samples. The ensemble allows for the evaluation of audio in different complementary ways, and is obtained by taking\r\na Cartesian product of two parameter spaces: (i) the size of the random windows fed into the discriminator; (ii) whether a discriminator is conditioned on linguistic and pitch features. For example,\r\nin the authors' best-performing model, they consider five window sizes (240, 480, 960, 1920, 3600 samples), which yields 10 discriminators in total. \r\n\r\nUsing random windows of different size, rather than the full generated sample, has a data augmentation effect and also reduces the computational complexity of RWDs. In the first layer of each discriminator, the MRWD reshapes (downsamples) the input raw waveform to a constant\r\ntemporal dimension $\\omega = 240$ by moving consecutive blocks of samples into the channel dimension, i.e. from $\\left[\\omega\\_{k}, 1\\right]$ to $\\left[\\omega, k\\right]$, where $k$ is the downsampling factor (e.g. $k = 8$ for input window size $1920$). This way, all the RWDs have the same architecture and similar computational complexity despite different window sizes. \r\n\r\nThe conditional discriminators have access to linguistic and pitch features, and can measure whether\r\nthe generated audio matches the input conditioning. This means that random windows in conditional\r\ndiscriminators need to be aligned with the conditioning frequency to preserve the correspondence\r\nbetween the waveform and linguistic features within the sampled window. This limits the valid sampling to that of the frequency of the conditioning signal (200Hz, or every 5ms). The unconditional\r\ndiscriminators, on the contrary, only evaluate whether the generated audio sounds realistic regardless\r\nof the conditioning. The random windows for these discriminators are sampled without constraints\r\nat full 24kHz frequency, which further increases the amount of training data. \r\n\r\nFor the architecture, the discriminators consists of blocks (DBlocks) that are similar to the [GBlocks](https://paperswithcode.com/method/gblock) used in the generator, but without batch normalisation. Unconditional RWDs are composed entirely of DBlocks. In conditional RWDs, the input waveform is gradually downsampled by DBlocks, until the temporal dimension of the activation is equal to that of the conditioning, at which point a conditional [DBlock](https://paperswithcode.com/method/dblock) is used. This joint information is then passed to the remaining DBlocks, whose final output is average-pooled to obtain a scalar. The dilation factors in the DBlocks’ convolutions follow the pattern 1, 2, 1, 2 – unlike the generator, the discriminator operates on a relatively small window, and the authors did not observe any benefit from using larger dilation factors.",
  "title": "High Fidelity Speech Synthesis with Adversarial Networks",
  "collection": "Discriminators",
  "area": "General"
}
{
  "name": "EDLPS",
  "full_name": "Encoder-Decoder model with local and pairwise loss along with shared encoder and discriminator network (EDLPS)",
  "description": "In this paper, we propose a method for obtaining sentence-level embeddings. While the problem of obtaining word-level embeddings is very well studied, we propose a novel method for obtaining sentence-level embeddings. This is obtained by a simple method in the context of solving the paraphrase generation task. If we use a sequential encoder-decoder model for generating paraphrase, we would like the generated paraphrase to be semantically close to the original sentence. One way to ensure this is by adding constraints for true paraphrase embeddings to be close and unrelated paraphrase candidate sentence embeddings to be far. This is ensured by using a sequential pair-wise discriminator that shares weights with the encoder. This discriminator is trained with a suitable loss function. Our loss function penalizes paraphrase sentence embedding distances from being too large. This loss is used in combination with a sequential encoder-decoder network. We also validate our method by evaluating the obtained embeddings for a sentiment analysis task. The proposed method results in semantic embeddings and provide competitive results on the paraphrase generation and sentiment analysis task on standard dataset. These results are also shown to be statistically significant.\r\n\r\n\r\n\r\n\r\nGithub Link:https://github.com/dev-chauhan/PQG-pytorch.\r\n\r\n2\r\nThe PQG dataset is available on this link: https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs.\r\n\r\n3\r\nwebsite: https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs.\r\n\r\n4\r\nwe report same baseline results as mentioned in [10]\r\n\r\n5\r\nwebsite: www.kaggle.com/c/sentiment-analysis-on-movie-reviews.\r\n\r\n6\r\nCode: https://github.com/dev-chauhan/PQG-pytorch.",
  "title": null,
  "collection": "Document Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "Herring",
  "full_name": "Herring",
  "description": "**Herring** is a parameter server based distributed training method. It combines AWS's Elastic Fabric [Adapter](https://paperswithcode.com/method/adapter) (EFA) with a novel parameter sharding technique that makes better use of the available network bandwidth.  Herring uses EFA and balanced fusion buffer to optimally use the total bandwidth available across all nodes in the cluster. Herring reduces gradients hierarchically, reducing them inside the node first and then reducing across nodes. This enables more efficient use of PCIe bandwidth in the node and helps keep the gradient averaging related burden on GPU low.",
  "title": null,
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "GPS",
  "full_name": "Greedy Policy Search",
  "description": "**Greedy Policy Search** (GPS) is a simple algorithm that learns a policy for test-time data augmentation based on the predictive performance on a validation set. GPS starts with an empty policy and builds it in an iterative fashion. Each step selects a sub-policy that provides the largest improvement in calibrated log-likelihood of ensemble predictions and adds it to the current policy.",
  "title": "Greedy Policy Search: A Simple Baseline for Learnable Test-Time Augmentation",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "POTO",
  "full_name": "Prediction-aware One-To-One",
  "description": "**Prediction-aware One-To-One**, or **POTO**, is an assignment rule for object detection which dynamically assigns the foreground samples according to the quality of classification and regression simultaneously.",
  "title": "End-to-End Object Detection with Fully Convolutional Network",
  "collection": "Detection Assignment Rules",
  "area": "Computer Vision"
}
{
  "name": "TaBERT",
  "full_name": "TaBERT",
  "description": "**TaBERT** is a pretrained language model (LM) that jointly learns representations for natural language sentences and (semi-)structured tables. TaBERT is trained on a large corpus of 26 million tables and their English contexts. \r\n\r\nIn summary, TaBERT's process for learning representations for NL sentences is as follows: Given an utterance $u$ and a table $T$, TaBERT first creates a content snapshot of $T$. This snapshot consists of sampled rows that summarize the information in $T$ most relevant to the input utterance. The model then linearizes each row in the snapshot, concatenates each linearized row with the utterance, and uses the concatenated string as input to a Transformer model, which outputs row-wise encoding vectors of utterance tokens and cells. The encodings for all the rows in the snapshot are fed into a series of vertical self-attention layers, where a cell representation (or an utterance token representation) is computed by attending to vertically-aligned vectors of the same column (or the same NL token). Finally, representations for each utterance token and column are generated from a pooling layer.",
  "title": "TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "Scatter Connection",
  "full_name": "Scatter Connection",
  "description": "A **Scatter Connection** is a type of connection that allows a vector to be \"scattered\" onto a layer representing a map, so that a vector at a specific location corresponds to objects of interest at that location (e.g. units in Starcraft II). This allows for the integration of spatial and non-spatial features.",
  "title": null,
  "collection": "Miscellaneous Components",
  "area": "General"
}
{
  "name": "Sigmoid Activation",
  "full_name": "Sigmoid Activation",
  "description": "**Sigmoid Activations** are a type of activation function for neural networks:\r\n\r\n$$f\\left(x\\right) = \\frac{1}{\\left(1+\\exp\\left(-x\\right)\\right)}$$\r\n\r\nSome drawbacks of this activation that have been noted in the literature are: sharp damp gradients during backpropagation from deeper hidden layers to inputs, gradient saturation, and slow convergence.",
  "title": null,
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "TTUR",
  "full_name": "Two Time-scale Update Rule",
  "description": "The **Two Time-scale Update Rule (TTUR)** is an update rule for generative adversarial networks trained with stochastic gradient descent. TTUR has an individual learning rate for both the discriminator and the generator. The main premise is that the discriminator converges to a local minimum when the generator is fixed. If the generator changes slowly enough, then the discriminator still converges, since the generator perturbations are small. Besides ensuring convergence, the performance may also improve since the discriminator must first learn new patterns before they are transferred to the generator. In contrast, a generator which is overly fast, drives the discriminator steadily into new regions without capturing its gathered information.",
  "title": "GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium",
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "Snapshot Ensembles",
  "full_name": "Snapshot Ensembles: Train 1, get M for free",
  "description": "The  overhead  cost  of  training  multiple  deep  neural networks  could  be  very  high  in  terms  of  the  training  time, hardware, and computational resource requirement and often acts  as  obstacle  for  creating  deep  ensembles.  To  overcome these barriers Huang et al. proposed a unique method to create  ensemble  which  at  the  cost  of  training  one  model, yields  multiple  constituent  model  snapshots  that  can  be ensembled together to create a strong learner. The core idea behind the concept is to make the model converge to several local minima along the optimization path and save the model parameters at these local minima points. During the training phase, a neural network would traverse through many such points. The lowest of all such local minima is known as the Global Minima. The larger the model, more are the number of parameters and larger the number of local minima points. This implies, there are discrete sets of weights and biases, at which  the  model  is  making  fewer  errors.  So,  every  such minimum  can  be  considered a  weak  but  a  potential learner model for the problem being solved. Multiple such snapshot of  weights  and  biases  are  recorded  which  can  later  be ensembled to get a better generalized model which makes the least amount of mistakes.",
  "title": "Snapshot Ensembles: Train 1, get M for free",
  "collection": "Active Learning",
  "area": "General"
}
{
  "name": "Attention Free Transformer",
  "full_name": "Attention Free Transformer",
  "description": "**Attention Free Transformer**, or **AFT**, is an efficient variant of a [multi-head attention module](https://paperswithcode.com/method/multi-head-attention) that eschews [dot product self attention](https://paperswithcode.com/method/scaled). In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes.\r\n\r\nGiven the input $X$, AFT first linearly transforms them into $Q=X W^{Q}, K=X W^{K}, V=X W^{V}$, then performs following operation:\r\n\r\n$$\r\nY=f(X) ; Y\\_{t}=\\sigma\\_{q}\\left(Q\\_{t}\\right) \\odot \\frac{\\sum\\_{t^{\\prime}=1}^{T} \\exp \\left(K\\_{t^{\\prime}}+w\\_{t, t^{\\prime}}\\right) \\odot V\\_{t^{\\prime}}}{\\sum\\_{t^{\\prime}=1}^{T} \\exp \\left(K\\_{t^{\\prime}}+w\\_{t, t^{\\prime}}\\right)}\r\n$$\r\n\r\nwhere $\\odot$ is the element-wise product; $\\sigma\\_{q}$ is the nonlinearity applied to the query with default being sigmoid; $w \\in R^{T \\times T}$ is the learned pair-wise position biases.\r\n\r\nExplained in words, for each target position $t$, AFT performs a weighted average of values, the result of which is combined with the query with element-wise multiplication. In particular, the weighting is simply composed of the keys and a set of learned pair-wise position biases. This provides the immediate advantage of not needing to compute and store the expensive attention matrix, while maintaining the global interactions between query and values as MHA does.",
  "title": "An Attention Free Transformer",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "Self-Adversarial Negative Sampling",
  "full_name": "Self-Adversarial Negative Sampling",
  "description": "**Self-Adversarial Negative Sampling** is a negative sampling technique used for methods like [word embeddings](https://paperswithcode.com/methods/category/word-embeddings) and [knowledge graph embeddings](https://paperswithcode.com/methods/category/graph-embeddings). The traditional negative sampling loss from word2vec for optimizing distance-based models be written as:\r\n\r\n$$ L = −\\log\\sigma\\left(\\gamma − d\\_{r}\\left(\\mathbf{h}, \\mathbf{t}\\right)\\right) − \\sum^{n}\\_{i=1}\\frac{1}{k}\\log\\sigma\\left(d\\_{r}\\left(\\mathbf{h}^{'}\\_{i}, \\mathbf{t}^{'}\\_{i}\\right) - \\gamma\\right) $$\r\n\r\nwhere $\\gamma$ is a fixed margin, $\\sigma$ is the sigmoid function, and $\\left(\\mathbf{h}^{'}\\_{i}, r, \\mathbf{t}^{'}\\_{i}\\right)$ is the $i$-th negative triplet. \r\n\r\nThe negative sampling loss samples the negative triplets in a uniform way. Such a uniform negative sampling suffers the problem of inefficiency since many samples are obviously false as training goes on, which does not provide any meaningful information. Therefore, the authors propose an approach called self-adversarial negative sampling, which samples negative triples according to the current embedding model. Specifically, we sample negative triples from the following distribution:\r\n\r\n$$ p\\left(h^{'}\\_{j}, r, t^{'}\\_{j} | \\text{set}\\left(h\\_{i}, r\\_{i}, t\\_{i} \\right) \\right) = \\frac{\\exp\\alpha{f}\\_{r}\\left(\\mathbf{h}^{'}\\_{j}, \\mathbf{t}^{'}\\_{j}\\right)}{\\sum\\_{i=1}\\exp\\alpha{f}\\_{r}\\left(\\mathbf{h}^{'}\\_{i}, \\mathbf{t}^{'}\\_{i}\\right)} $$\r\n\r\nwhere $\\alpha$ is the temperature of sampling. Moreover, since the sampling procedure may be costly, the authors treat the above probability as the weight of the negative sample. Therefore, the final negative sampling loss with self-adversarial training takes the following form:\r\n\r\n$$ L = −\\log\\sigma\\left(\\gamma − d\\_{r}\\left(\\mathbf{h}, \\mathbf{t}\\right)\\right) − \\sum^{n}\\_{i=1}p\\left(h^{'}\\_{i}, r, t^{'}\\_{i}\\right)\\log\\sigma\\left(d\\_{r}\\left(\\mathbf{h}^{'}\\_{i}, \\mathbf{t}^{'}\\_{i}\\right) - \\gamma\\right) $$",
  "title": "RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space",
  "collection": "Negative Sampling",
  "area": "General"
}
{
  "name": "SCCL",
  "full_name": "Supporting Clustering with Contrastive Learning",
  "description": "**SCCL**, or **Supporting Clustering with Contrastive Learning**, is a framework to leverage contrastive learning to promote better separation in unsupervised clustering. It combines the top-down clustering with the bottom-up instance-wise contrastive learning to achieve better inter-cluster distance and intra-cluster distance. During training, we jointly optimize a clustering loss over the original data instances and an instance-wise contrastive loss over the associated augmented pairs.",
  "title": "Supporting Clustering with Contrastive Learning",
  "collection": "Clustering",
  "area": "General"
}
{
  "name": "Global Sub-Sampled Attention",
  "full_name": "Global Sub-Sampled Attention",
  "description": "**Global Sub-Sampled Attention**, or **GSA**, is a local [attention mechanism](https://paperswithcode.com/methods/category/attention-mechanisms-1) used in the [Twins-SVT](https://paperswithcode.com/method/twins-svt) architecture. \r\n\r\nA single representative is used to summarize the key information for each of $m \\times n$ subwindows and the representative is used to communicate with other sub-windows (serving as the key in self-attention), which can reduce the cost to $\\mathcal{O}(m n H W d)=\\mathcal{O}\\left(\\frac{H^{2} W^{2} d}{k\\_{1} k\\_{2}}\\right)$. This is essentially equivalent to using the sub-sampled feature maps as the key in attention operations, and thus it is termed global sub-sampled attention (GSA). \r\n\r\nIf we alternatively use the [LSA](https://paperswithcode.com/method/locally-grouped-self-attention) and GSA like [separable convolutions](https://paperswithcode.com/method/depthwise-separable-convolution) (depth-wise + point-wise). The total computation cost is $\\mathcal{O}\\left(\\frac{H^{2} W^{2} d}{k\\_{1} k\\_{2}}+k\\_{1} k\\_{2} H W d\\right) .$ We have:\r\n\r\n$$\\frac{H^{2} W^{2} d}{k\\_{1} k\\_{2}}+k_{1} k_{2} H W d \\geq 2 H W d \\sqrt{H W} $$ \r\n\r\nThe minimum is obtained when $k\\_{1} \\cdot k\\_{2}=\\sqrt{H W}$. Note that $H=W=224$ is popular in classification. Without loss of generality, square sub-windows are used, i.e., $k\\_{1}=k\\_{2}$. Therefore, $k\\_{1}=k\\_{2}=15$ is close to the global minimum for $H=W=224$. However, the network is designed to include several stages with variable resolutions. Stage 1 has feature maps of $56 \\times 56$, the minimum is obtained when $k\\_{1}=k\\_{2}=\\sqrt{56} \\approx 7$. Theoretically, we can calibrate optimal $k\\_{1}$ and $k\\_{2}$ for each of the stages. For simplicity, $k\\_{1}=k\\_{2}=7$ is used everywhere. As for stages with lower resolutions, the summarizing window-size of GSA is controlled to avoid too small amount of generated keys. Specifically, the sizes of 4,2 and 1 are used for the last three stages respectively.",
  "title": "Twins: Revisiting the Design of Spatial Attention in Vision Transformers",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "1D CNN",
  "full_name": "1-Dimensional Convolutional Neural Networks",
  "description": "1D Convolutional Neural Networks are similar to well known and more established 2D Convolutional Neural Networks. 1D Convolutional Neural Networks are used mainly used on text and 1D signals.",
  "title": "Convolutional Neural Network and Rule-Based Algorithms for Classifying 12-lead ECGs",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "Inception-A",
  "full_name": "Inception-A",
  "description": "**Inception-A** is an image model block used in the [Inception-v4](https://paperswithcode.com/method/inception-v4) architecture.",
  "title": "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "SAENet",
  "full_name": "Squeeze aggregated excitation network",
  "description": "This method introduces the aggregated dense block within the squeeze excitation block to enhance representation. The squeeze method compresses the input flow and sends it to excitation with dense layers to regain its shape. The paper introduces multiple dense layers stacked side by side, similar to ResNeXt. This learns global representations from the condensed information which enhances the representational power of the network.",
  "title": "Squeeze aggregated excitation network",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "LightConv",
  "full_name": "Lightweight Convolution",
  "description": "**LightConv** is a type of [depthwise convolution](https://paperswithcode.com/method/depthwise-convolution) for sequential modelling which shares certain output channels and whose weights are normalized across the temporal dimension using a [softmax](https://paperswithcode.com/method/softmax). Compared to self-attention, LightConv has a fixed context window and it determines the importance of context elements with a set of weights that do not change over time steps. LightConv computes the following for the $i$-th element in the sequence and output channel $c$:\r\n\r\n$$ \\text{LightConv}\\left(X, W\\_{\\text{ceil}\\left(\\frac{cH}{d}\\right),:}, i, c\\right) = \\text{DepthwiseConv}\\left(X,\\text{softmax}\\left(W\\_{\\text{ceil}\\left(\\frac{cH}{d}\\right),:}\\right), i, c\\right) $$",
  "title": "Pay Less Attention with Lightweight and Dynamic Convolutions",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "GraphESN",
  "full_name": "Graph Echo State Network",
  "description": "**Graph Echo State Network** (**GraphESN**) model is a generalization of the Echo State Network (ESN) approach to graph domains. GraphESNs allow for an efficient approach to Recursive Neural Networks (RecNNs) modeling extended to deal with cyclic/acyclic, directed/undirected, labeled graphs. The recurrent reservoir of the network computes a fixed contractive encoding function over graphs and is left untrained after initialization, while a feed-forward readout implements an adaptive linear output function. Contractivity of the state transition function implies a Markovian characterization of state dynamics and stability of the state computation in presence of cycles. Due to the use of fixed (untrained) encoding, the model represents both an extremely efficient version and a baseline for the performance of recursive models with trained connections.\r\n\r\nDescription from: [Graph Echo State Networks](https://ieeexplore.ieee.org/document/5596796)",
  "title": null,
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "Embedding Dropout",
  "full_name": "Embedding Dropout",
  "description": "**Embedding Dropout** is equivalent to performing [dropout](https://paperswithcode.com/method/dropout) on the embedding matrix at a word level, where the dropout is broadcast across all the word vector’s embedding. The remaining non-dropped-out word embeddings are scaled by $\\frac{1}{1-p\\_{e}}$ where $p\\_{e}$ is the probability of embedding dropout. As the dropout occurs on the embedding matrix that is used for a full forward and backward pass, this means that all occurrences of a specific word will disappear within that pass, equivalent to performing [variational dropout](https://paperswithcode.com/method/variational-dropout) on the connection between the one-hot embedding and the embedding lookup.\r\n\r\nSource: Merity et al, Regularizing and Optimizing [LSTM](https://paperswithcode.com/method/lstm) Language Models",
  "title": "A Theoretically Grounded Application of Dropout in Recurrent Neural Networks",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "mLSTM",
  "full_name": "Multiplicative LSTM",
  "description": "A **Multiplicative LSTM (mLSTM)** is a  recurrent neural network architecture for sequence modelling that combines the long short-term memory ([LSTM](https://paperswithcode.com/method/lstm)) and multiplicative recurrent neural network ([mRNN](https://paperswithcode.com/method/mrnn)) architectures. The mRNN and LSTM architectures can be combined by adding connections from the mRNN’s intermediate state $m\\_{t}$ to each gating units in the LSTM.",
  "title": "Multiplicative LSTM for sequence modelling",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "Concatenation Affinity",
  "full_name": "Concatenation Affinity",
  "description": "**Concatenation Affinity** is a type of affinity or self-similarity function between two points $\\mathbb{x\\_{i}}$ and $\\mathbb{x\\_{j}}$ that uses a concatenation function:\r\n\r\n$$ f\\left(\\mathbb{x\\_{i}}, \\mathbb{x\\_{j}}\\right) = \\text{ReLU}\\left(\\mathbb{w}^{T}\\_{f}\\left[\\theta\\left(\\mathbb{x}\\_{i}\\right), \\phi\\left(\\mathbb{x}\\_{j}\\right)\\right]\\right)$$\r\n\r\nHere $\\left[·, ·\\right]$ denotes concatenation and $\\mathbb{w}\\_{f}$ is a weight vector that projects the concatenated vector to a scalar.",
  "title": "Non-local Neural Networks",
  "collection": "Affinity Functions",
  "area": "General"
}
{
  "name": "Grid Sensitive",
  "full_name": "Grid Sensitive",
  "description": "**Grid Sensitive** is a trick for object detection introduced by [YOLOv4](https://paperswithcode.com/method/yolov4). When we decode the coordinate of the bounding box center $x$ and $y$, in original [YOLOv3](https://paperswithcode.com/method/yolov3), we can get them by\r\n\r\n$$\r\n\\begin{aligned}\r\n&x=s \\cdot\\left(g\\_{x}+\\sigma\\left(p\\_{x}\\right)\\right) \\\\\r\n&y=s \\cdot\\left(g\\_{y}+\\sigma\\left(p\\_{y}\\right)\\right)\r\n\\end{aligned}\r\n$$\r\n\r\nwhere $\\sigma$ is the sigmoid function, $g\\_{x}$ and $g\\_{y}$ are integers and $s$ is a scale factor. Obviously, $x$ and $y$ cannot be exactly equal to $s \\cdot g\\_{x}$ or $s \\cdot\\left(g\\_{x}+1\\right)$. This makes it difficult to predict the centres of bounding boxes that just located on the grid boundary. We can address this problem, by changing the equation to\r\n\r\n$$\r\n\\begin{aligned}\r\n&x=s \\cdot\\left(g\\_{x}+\\alpha \\cdot \\sigma\\left(p\\_{x}\\right)-(\\alpha-1) / 2\\right) \\\\\r\n&y=s \\cdot\\left(g\\_{y}+\\alpha \\cdot \\sigma\\left(p\\_{y}\\right)-(\\alpha-1) / 2\\right)\r\n\\end{aligned}\r\n$$\r\n\r\nThis makes it easier for the model to predict bounding box center exactly located on the grid boundary. The FLOPs added by Grid Sensitive are really small, and can be totally ignored.",
  "title": "YOLOv4: Optimal Speed and Accuracy of Object Detection",
  "collection": "Object Detection Modules",
  "area": "Computer Vision"
}
{
  "name": "DSelect-k",
  "full_name": "DSelect-k",
  "description": "**DSelect-k** is a continuously differentiable and sparse gate for Mixture-of-experts (MoE), based on a novel binary encoding formulation. Given a user-specified parameter $k$, the gate selects at most $k$ out of the $n$ experts. The gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. This explicit control over sparsity leads to a cardinality-constrained optimization problem, which is computationally challenging. To circumvent this challenge, the authors use a unconstrained reformulation that is equivalent to the original problem. The reformulated problem uses a binary encoding scheme to implicitly enforce the cardinality constraint. By carefully smoothing the binary encoding variables, the reformulated problem can be effectively optimized using first-order methods such as [SGD](https://paperswithcode.com/method/sgd).\r\n\r\nThe motivation for this method is that  existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods.",
  "title": "DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning",
  "collection": "Mixture-of-Experts",
  "area": "General"
}
{
  "name": "Gravity",
  "full_name": "Gravity",
  "description": "Gravity is a kinematic approach to optimization based on gradients.",
  "title": "Gravity Optimizer: a Kinematic Approach on Optimization in Deep Learning",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "MPN",
  "full_name": "Matrix-power Normalization",
  "description": "",
  "title": "Is Second-order Information Helpful for Large-scale Visual Recognition?",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "MSGAN",
  "full_name": "Multi-source Sentiment Generative Adversarial Network",
  "description": "**Multi-source Sentiment Generative Adversarial Network** is a multi-source domain adaptation (MDA) method for visual sentiment classification. It is composed of three pipelines, i.e., image reconstruction, image translation, and cycle-reconstruction. To handle data from multiple source domains, it learns to find a unified sentiment latent space where data from both the source and target domains share a similar distribution. This is achieved via cycle consistent adversarial learning in an end-to-end manner. Notably, thanks to the unified sentiment latent space, MSGAN requires a single classification network to handle data from different source domains.",
  "title": "Multi-source Domain Adaptation for Visual Sentiment Classification",
  "collection": "Generative Adversarial Networks",
  "area": "Computer Vision"
}
{
  "name": "Dense Synthesized Attention",
  "full_name": "Dense Synthesized Attention",
  "description": "**Dense Synthesized Attention**, introduced with the [Synthesizer](https://paperswithcode.com/method/synthesizer) architecture, is a type of synthetic attention mechanism that replaces the notion of [query-key-values](https://paperswithcode.com/method/scaled) in the self-attention module and directly synthesizes the alignment matrix instead. Dense attention is conditioned on each input token. The method accepts an input $X \\in \\mathbb{R}^{l\\text{ x }d}$ and produces an output of $Y \\in \\mathbb{R}^{l\\text{ x }d}$. Here $l$ refers to the sequence length and $d$ refers to the dimensionality of the model. We first adopt $F\\left(.\\right)$, a parameterized function, for projecting input $X\\_{i}$ from $d$ dimensions to $l$ dimensions.\r\n\r\n$$B\\_{i} = F\\left(X\\_{i}\\right)$$\r\n\r\nwhere $F\\left(.\\right)$ is a parameterized function that maps $\\mathbb{R}^{d}$ to $\\mathbb{R}^{l}$ and $i$ is the $i$-th token of $X$. Intuitively, this can be interpreted as learning a token-wise projection to the sequence length $l$. Essentially, with this model, each token predicts weights for each token in the input sequence. In practice, a simple two layered feed-forward layer with [ReLU](https://paperswithcode.com/method/relu) activations for $F\\left(.\\right)$ is adopted:\r\n\r\n$$ F\\left(X\\right) = W\\left(\\sigma\\_{R}\\left(W(X) + b\\right)\\right) + b$$\r\n\r\nwhere $\\sigma\\_{R}$ is the ReLU activation function. Hence, $B$ is now of $\\mathbb{R}^{l\\text{ x }d}$. Given $B$, we now compute:\r\n\r\n$$ Y = \\text{Softmax}\\left(B\\right)G\\left(X\\right) $$\r\n\r\nwhere $G\\left(.\\right)$ is another parameterized function of $X$ that is analogous to $V$ (value) in the standard [Transformer](https://paperswithcode.com/method/transformer) model. This approach eliminates the [dot product](https://paperswithcode.com/method/scaled) altogether by replacing $QK^{T}$ in standard Transformers with the synthesizing function $F\\left(.\\right)$.",
  "title": "Synthesizer: Rethinking Self-Attention in Transformer Models",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "CondConv",
  "full_name": "CondConv",
  "description": "**CondConv**, or **Conditionally Parameterized Convolutions**, are a type of [convolution](https://paperswithcode.com/method/convolution) which learn specialized convolutional kernels for each example. In particular, we parameterize the convolutional kernels in a CondConv layer as a linear combination of $n$ experts $(\\alpha_1 W_1 + \\ldots + \\alpha_n W_n) * x$, where $\\alpha_1, \\ldots, \\alpha_n$ are functions of the input learned through gradient descent. To efficiently increase the capacity of a CondConv layer, developers can increase the number of experts. This can be more computationally efficient than increasing the size of the convolutional kernel itself, because the convolutional kernel is applied at many different positions within the input, while the experts are combined only once per input.",
  "title": "CondConv: Conditionally Parameterized Convolutions for Efficient Inference",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "DDPG",
  "full_name": "Deep Deterministic Policy Gradient",
  "description": "**DDPG**, or **Deep Deterministic Policy Gradient**, is an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. It combines the actor-critic approach with insights from [DQNs](https://paperswithcode.com/method/dqn): in particular, the insights that 1) the network is trained off-policy with samples from a replay buffer to minimize correlations between samples, and 2) the network is trained with a target Q network to give consistent targets during temporal difference backups. DDPG makes use of the same ideas along with [batch normalization](https://paperswithcode.com/method/batch-normalization).",
  "title": "Continuous control with deep reinforcement learning",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "CT-Layer",
  "full_name": "Commute Times Layer",
  "description": "**TL;DR: CT-Layer is a GNN Layer which is able to rewire a graph in an inductive an parameter-free way according to the commute times distance (or effective resistance). We address it learning a differentiable way to compute the CT-embedding of the graph.**\r\n\r\n### Summary\r\n\r\n**CT-Layer** is able to Learn the *Commute Times distance*  between nodes (i.e. *effective resistance distance*) in a **differentiable** way, instead of the common spectral version, and in a **parameter free** manner, which is not the cased of the heat kernel. This approach allow to solve it as an optimization problem inside a GNN, leading to have a new layer which is able to learn how rewire a given graph in an optimal, and **inductive** way. \r\n\r\nIn addition, **CT-Layer** also is able to learn *Commute Times embeddings*, and then calculate it for any graph in an inductive way. The Commute Times embedding is also related with the *eigenvalues* and *eigenvectors* of the Laplacian of the graph, because CT embedding is just the eigenvectors scaled. Therefore, CT-Layer is also able to learn hot to calculate the spectrum of the Laplacian in a differentiable way. Therefore, this embedding must satisfy orthogonality and normality.\r\n\r\nFinally, recent connections has been found between commute times distance and **curvature** (which is non-differentiable), establishing equivalences between them. Therefore, **CT-Layer** can also be seen as the differentiable version of the curvature rewiring.\r\n\r\n**We are going through a quick overview of the layer, but I suggest go to the paper for a detailed explanation. **\r\n\r\n### Spectral CT- Embedding downsides\r\nCT-embedding $\\mathbf{Z}$ is computed spectrally  in the literature (until the proposal of this method) or it is approximated using the heat kernel (very dependent on hyperparameter $t$). This fact does not allow us to propose differentiable methods using that measure:\r\n$$\r\n\\mathbf{Z}=\\sqrt{vol(G)}\\mathbf{\\Lambda}^\\frac{1}{2}\\mathbf{F}^T \\textrm{ given } \\mathbf{L}=\\mathbf{F}\\mathbf{\\Lambda}\\mathbf{F}^T\r\n$$\r\n\r\nThen, CT-distance  is given by the Euclidean distances between the embeddings $CT_{uv} = ||\\mathbf{z_u}-\\mathbf{z_v}||^2$. The spectral form is: \r\n\r\n$$\r\n\\frac{CT_{uv}}{vol(G)} = \\sum_{i=2}^n \\frac{1}{\\lambda_i} (\\mathbf{f}(u)-\\mathbf{f}(v))^2 \r\n$$\r\nbeing $\\mathbf{f}$ the eigenvectors of the graph Laplacian. \r\n\r\nThis embedding and distances gives us desirable properties of the graph, such an understanding of the structure, or an embedding based on the spectrum which minimizes Dirichlet energies. However, **the spectral computation is not differentiable**.\r\n\r\n### CT-Layer as an optimization problem: Differentiable, learnable and inductive CT-Layer\r\nGiving that $\\mathbf{Z}$ minimizes Dirichlet energies s.t. being orthogonal and normalized, we can formulate this problem as constraining neighboring nodes to have a similar embeddings s.t. $\\mathbf{Z}\\mathbf{Z}^T=\\mathbf{I}$.\r\n\r\n$$\r\n\\mathbf{Z} = \\arg\\min_{\\mathbf{Z}^T\\mathbf{Z}=\\mathbf{I}} \\frac{\\sum\\_{u,v} ||\\mathbf{z_u}-\\mathbf{z_v}||^2\\mathbf{A}\\_{uv}}{\\sum\\_{u,v} \\mathbf{Z}^2\\_{uv} d_u}=\\frac{Tr[\\mathbf{Z}^T\\mathbf{L}\\mathbf{Z}]}{Tr[\\mathbf{Z}^T\\mathbf{D}\\mathbf{Z}]}\r\n$$\r\n\r\nWith the above elements we have a definition of **CT-Layer**, our rewiring layer: \r\nGiven the matrix $\\mathbf{X}\\_{n\\times F}$ encoding the features of the nodes after any message passing (MP) layer, $\\mathbf{Z}\\_{n\\times O(n)}=\\tanh(\\textrm{MLP}(\\mathbf{X}))$ learns the association $\\mathbf{X}\\rightarrow \\mathbf{Z}$ while $\\mathbf{Z}$ is optimized according to the loss \r\n$$\r\nL\\_{CT} = \\frac{Tr[\\mathbf{Z}^T\\mathbf{L}\\mathbf{Z}]}{Tr[\\mathbf{Z}^T\\mathbf{D}\\mathbf{Z}]} + \\left\\|\\frac{\\mathbf{Z}^T\\mathbf{Z}}{\\|\\mathbf{Z}^T\\mathbf{Z}\\|\\_F} - \\mathbf{I}\\_n\\right\\|\\_F\r\n$$\r\n This results in the following **resistance diffusion** $\\mathbf{T}^{CT} = \\mathbf{R}(\\mathbf{S})\\odot \\mathbf{A}$ (Hadamard product between the resistance distance and the adjacency) which provides as input to the subsequent MP layer a learnt convolution matrix.\r\n\r\nAs explained before, $\\mathbf{Z}$ is the **commute times embedding matrix** and the pairwise euclidian distance of that learned embeddings are the **commute times distances** or resistance distances. **Therefore, once trained this layer, it will be able to calculate the commute times embedding for a new graph, and rewire that new and unseen graph in a principled way based on the commute times distance.**\r\n\r\n## Preservation of Structure\r\nDoes this rewiring preserve the original structure? Let $G' = \\textrm{Sparsify}(G, q)$ be a sampling algorithm of graph $G = (V, E)$, where edges $e \\in E$ are sampled with probability $q\\propto R_e$ (**proportional to the effective resistance, i.e. commute times**).\r\nThen, for $n = |V|$ sufficiently large and $1/\\sqrt{n}< \\epsilon\\le 1$, we need O(n\\log n/\\epsilon^2)$ samples to satisfy:\r\n\r\n$$\r\n\\forall \\mathbf{x}\\in\\mathbb{R}^n:\\; (1-\\epsilon)\\mathbf{x}^T\\mathbf{L}\\_G\\mathbf{x}\\le\\mathbf{x}^T\\mathbf{L}\\_{G'}\\mathbf{x}\\le (1+\\epsilon)\\mathbf{x}^T\\mathbf{L}\\_G\\mathbf{x}\r\n$$\r\n\r\nThe intuitions behind is that Dirichlet energies in $G'$ are bounded in $(1\\pm \\epsilon)$ of the Dirichlet energies of the original graph $G$.",
  "title": "DiffWire: Inductive Graph Rewiring via the Lovász Bound",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "Syntax Heat Parse Tree",
  "full_name": "Syntax Heat Parse Tree",
  "description": "Syntax Heat Parse Tree are heatmaps over parse trees, similar to [\"heat trees\"](https://doi.org/10.1371/journal.pcbi.1005404) in biology.",
  "title": "KERMIT: Complementing Transformer Architectures with Encoders of Explicit Syntactic Interpretations",
  "collection": "Interpretability",
  "area": "General"
}
{
  "name": "Blended Diffusion",
  "full_name": "Blended Diffusion",
  "description": "Blended Diffusion enables a zero-shot local text-guided image editing of natural images.\r\nGiven an input image $x$, an input mask $m$ and a target guiding text $t$ - the method enables to change the masked area within the image corresponding the the guiding text s.t. the unmasked area is left unchanged.",
  "title": "Blended Diffusion for Text-driven Editing of Natural Images",
  "collection": "Image Generation Models",
  "area": "Computer Vision"
}
{
  "name": "Variational Inference",
  "full_name": "Variational Inference",
  "description": "",
  "title": "Autoencoding Variational Autoencoder",
  "collection": "Dimensionality Reduction",
  "area": "General"
}
{
  "name": "Spatially Separable Convolution",
  "full_name": "Spatially Separable Convolution",
  "description": "A **Spatially Separable Convolution** decomposes a [convolution](https://paperswithcode.com/method/convolution) into two separate operations. In regular convolution, if we have a 3 x 3 kernel then we directly convolve this with the image. We can divide a 3 x 3 kernel into a 3 x 1 kernel and a 1 x 3 kernel. Then, in spatially separable convolution, we first convolve the 3 x 1 kernel then the 1 x 3 kernel. This requires 6 instead of 9 parameters compared to regular convolution, and so it is more parameter efficient (additionally less matrix multiplications are required).\r\n\r\nImage Source: [Kunlun Bai](https://towardsdatascience.com/a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215)",
  "title": null,
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "AGCN",
  "full_name": "Adaptive Graph Convolutional Neural Networks",
  "description": "AGCN is a novel spectral graph convolution network that feed on original data of diverse graph structures.\r\n\r\nImage credit: [Adaptive Graph Convolutional Neural Networks](https://arxiv.org/pdf/1801.03226.pdf)",
  "title": "Adaptive Graph Convolutional Neural Networks",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "Fast Minimum-Norm Attack",
  "full_name": "Fast Minimum-Norm Attack",
  "description": "**Fast Minimum-Norm Attack**, or **FNM**, is a type of adversarial attack that works with different $\\ell_{p}$-norm perturbation models ($p=0,1,2,\\infty$), is robust to hyperparameter choices, does not require adversarial starting points, and converges within few lightweight steps. It works by iteratively finding the sample misclassified with maximum confidence within an $\\ell_{p}$-norm constraint of size $\\epsilon$, while adapting $\\epsilon$ to minimize the distance of the current sample to the decision boundary.",
  "title": "Fast Minimum-norm Adversarial Attacks through Adaptive Norm Constraints",
  "collection": "Adversarial Attacks",
  "area": "General"
}
{
  "name": "Mesh-TensorFlow",
  "full_name": "Mesh-TensorFlow",
  "description": "**Mesh-TensorFlow** is a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the \"batch\" dimension, in Mesh-TensorFlow, the user can specify any tensor dimensions to be split across any dimensions of a multi-dimensional mesh of processors. A MeshTensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce.",
  "title": "Mesh-TensorFlow: Deep Learning for Supercomputers",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "Soft Split and Soft Composition",
  "full_name": "Soft Split and Soft Composition",
  "description": "**Soft Split and Soft Composition** are video frame based operations used in the [FuseFormer](https://paperswithcode.com/method/fuseformer) architecture, specifically the [FuseFormer blocks](https://paperswithcode.com/method/fuseformer-block). We softly split each frame into overlapped patches and then softly composite them back, by using an unfold and fold operator with patch size $k$ being greater than patch stride $s$. When compositing patches back to its original spatial shape, we add up feature values at each overlapping spatial location of neighboring patches.",
  "title": "FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting",
  "collection": "Video Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "ResNeXt-Elastic",
  "full_name": "ResNeXt-Elastic",
  "description": "**ResNeXt-Elastic** is a convolutional neural network that is a modification of a [ResNeXt](https://paperswithcode.com/method/resnext) with elastic blocks (extra upsampling and downsampling).",
  "title": "ELASTIC: Improving CNNs with Dynamic Scaling Policies",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Fixed Factorized Attention",
  "full_name": "Fixed Factorized Attention",
  "description": "**Fixed Factorized Attention** is a factorized attention pattern where specific cells summarize previous locations and propagate that information to all future cells. It was proposed as part of the [Sparse Transformer](https://paperswithcode.com/method/sparse-transformer) architecture.\r\n\r\n\r\nA self-attention layer maps a matrix of input embeddings $X$ to an output matrix and is parameterized by a connectivity pattern $S = \\text{set}\\left(S\\_{1}, \\dots, S\\_{n}\\right)$, where $S\\_{i}$ denotes the set of indices of the input vectors to which the $i$th output vector attends. The output vector is a weighted sum of transformations of the input vectors:\r\n\r\n$$ \\text{Attend}\\left(X, S\\right) = \\left(a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right)\\right)\\_{i\\in\\text{set}\\left(1,\\dots,n\\right)}$$\r\n\r\n$$ a\\left(\\mathbf{x}\\_{i}, S\\_{i}\\right) = \\text{softmax}\\left(\\frac{\\left(W\\_{q}\\mathbf{x}\\_{i}\\right)K^{T}\\_{S\\_{i}}}{\\sqrt{d}}\\right)V\\_{S\\_{i}} $$\r\n\r\n$$ K\\_{Si} = \\left(W\\_{k}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\n$$ V\\_{Si} = \\left(W\\_{v}\\mathbf{x}\\_{j}\\right)\\_{j\\in{S\\_{i}}} $$\r\n\r\nHere $W\\_{q}$, $W\\_{k}$, and $W\\_{v}$ represent the weight matrices which transform a given $x\\_{i}$ into a query, key, or value, and $d$ is the inner dimension of the queries and keys. The output at each position is a sum of the values weighted by the scaled dot-product similarity of the keys and queries.\r\n\r\nFull self-attention for autoregressive models defines $S\\_{i} = \\text{set}\\left(j : j \\leq i\\right)$, allowing every element to attend to all previous positions and its own position.\r\n\r\nFactorized self-attention instead has $p$ separate attention heads, where the $m$th head defines a subset of the indices $A\\_{i}^{(m)} ⊂ \\text{set}\\left(j : j \\leq i\\right)$ and lets $S\\_{i} = A\\_{i}^{(m)}$. The goal with the Sparse [Transformer](https://paperswithcode.com/method/transformer) was to find efficient choices for the subset $A$.\r\n\r\nFormally for Fixed Factorized Attention, $A^{(1)}\\_{i} = ${$j : \\left(\\lfloor{j/l\\rfloor}=\\lfloor{i/l\\rfloor}\\right)$}, where the brackets denote the floor operation, and $A^{(2)}\\_{i} = ${$j : j \\mod l \\in ${$t, t+1, \\ldots, l$}}, where $t=l-c$ and $c$ is a hyperparameter. The $i$-th output vector of the attention head attends to all input vectors either from $A^{(1)}\\_{i}$ or $A^{(2)}\\_{i}$. This pattern can be visualized in the figure to the right.\r\n\r\nIf the stride is 128 and $c = 8$, then all future positions greater than 128 can attend to positions 120-128, all positions greater than 256 can attend to 248-256, and so forth. \r\n\r\nA fixed-attention pattern with $c = 1$ limits the expressivity of the network significantly, as many representations in the network are only used for one block whereas a small number of locations are used by all blocks. The authors found choosing $c \\in ${$8, 16, 32$} for typical values of $l \\in\r\n{128, 256}$ performs well, although this increases the computational cost of this method by $c$ in comparison to the [strided attention](https://paperswithcode.com/method/strided-attention).\r\n\r\nAdditionally, the authors found that when using multiple heads, having them attend to distinct subblocks of length $c$ within the block of size $l$ was preferable to having them attend to the same subblock.",
  "title": "Generating Long Sequences with Sparse Transformers",
  "collection": "Attention Patterns",
  "area": "Natural Language Processing"
}
{
  "name": "Minibatch Discrimination",
  "full_name": "Minibatch Discrimination",
  "description": "**Minibatch Discrimination** is a discriminative technique for generative adversarial networks where we discriminate between whole minibatches of samples rather than between individual samples. This is intended to avoid collapse of the generator.",
  "title": "Improved Techniques for Training GANs",
  "collection": "Generative Discrimination",
  "area": "Computer Vision"
}
{
  "name": "GloVe",
  "full_name": "GloVe Embeddings",
  "description": "**GloVe Embeddings** are a type of word embedding that encode the co-occurrence probability ratio between two words as vector differences. GloVe uses a weighted least squares objective $J$ that minimizes the difference between the dot product of the vectors of two words and the logarithm of their number of co-occurrences:\r\n\r\n$$ J=\\sum\\_{i, j=1}^{V}f\\left(𝑋\\_{i j}\\right)(w^{T}\\_{i}\\tilde{w}_{j} + b\\_{i} + \\tilde{b}\\_{j} - \\log{𝑋}\\_{ij})^{2} $$\r\n\r\nwhere $w\\_{i}$ and $b\\_{i}$ are the word vector and bias respectively of word $i$, $\\tilde{w}_{j}$ and $b\\_{j}$ are the context word vector and bias respectively of word $j$, $X\\_{ij}$ is the number of times word $i$ occurs in the context of word $j$, and $f$ is a weighting function that assigns lower weights to rare and frequent co-occurrences.",
  "title": "GloVe: Global Vectors for Word Representation",
  "collection": "Word Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "COCO-FUNIT",
  "full_name": "COCO-FUNIT",
  "description": "**COCO-FUNIT** is few-shot image translation model which computes the style embedding of the example images conditioned on the input image and a new module called the constant style bias. It builds on top of [FUNIT](https://arxiv.org/abs/1905.01723) by identifying the content loss problem and then addressing it with a novel content-conditioned style encoder architecture.\r\n\r\nThe FUNIT method suffers from the content loss problem—the translation result is not well-aligned with the input image. While a direct theoretical analysis is likely elusive, we conduct an empirical study, aiming at identify the cause of the content loss problem. In analyses, the authors show that the FUNIT style encoder produces very different style codes using different crops -- suggesting the style code contains other information about the style image such as the object pose.\r\n\r\nTo make the style embedding more robust to small variations in the style image, a new style encoder architecture, the Content-Conditioned style encoder (COCO), is introduced. The most distinctive feature of this new encoder is the conditioning in the content image as illustrated in the top-right of the Figure. Unlike the style encoder in FUNIT, COCO takes both content and style image as input. With this content-conditioning scheme, a direct feedback path is created during learning to let the content image influence how the style code is computed. It also helps reduce the direct influence of the style image to the extract style code.",
  "title": "COCO-FUNIT: Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder",
  "collection": "Unpaired Image-to-Image Translation",
  "area": "Computer Vision"
}
{
  "name": "CARLA",
  "full_name": "CARLA: An Open Urban Driving Simulator",
  "description": "CARLA is an open-source simulator for autonomous driving research. CARLA has been developed from the ground up to support development, training, and validation of autonomous urban driving systems. In addition to open-source code and protocols, CARLA provides open digital assets (urban layouts, buildings, vehicles) that were created for this purpose and can be used freely. \r\n\r\nSource: [Dosovitskiy et al.](https://arxiv.org/pdf/1711.03938v1.pdf)\r\n\r\nImage source: [Dosovitskiy et al.](https://arxiv.org/pdf/1711.03938v1.pdf)",
  "title": "CARLA: An Open Urban Driving Simulator",
  "collection": "Video Game Models",
  "area": "Reinforcement Learning"
}
{
  "name": "Softplus",
  "full_name": "Softplus",
  "description": "**Softplus** is an activation function $f\\left(x\\right) = \\log\\left(1+\\exp\\left(x\\right)\\right)$. It can be viewed as a smooth version of [ReLU](https://paperswithcode.com/method/relu).",
  "title": null,
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "CBNet",
  "full_name": "Composite Backbone Network",
  "description": "**CBNet** is a backbone architecture that consists of multiple identical backbones (specially called Assistant Backbones and Lead Backbone) and composite connections between neighbor backbones. From left to right, the output of each stage in an Assistant Backbone, namely higher-level\r\nfeatures, flows to the parallel stage of the succeeding backbone as part of inputs through composite connections. Finally, the feature maps of the last backbone named Lead\r\nBackbone are used for object detection. The features extracted by CBNet for object detection fuse the high-level and low-level features of multiple backbones, hence improve the detection performance.",
  "title": "CBNet: A Novel Composite Backbone Network Architecture for Object Detection",
  "collection": "Backbone Architectures",
  "area": "Computer Vision"
}
{
  "name": "SSTDA",
  "full_name": "Self-Supervised Temporal Domain Adaptation",
  "description": "**Self-Supervised Temporal Domain Adaptation (SSTDA)** is a method for action segmentation with self-supervised temporal domain adaptation. It contains two self-supervised auxiliary tasks (binary and sequential domain prediction) to jointly align cross-domain feature spaces embedded with local and global temporal dynamics.",
  "title": "Action Segmentation with Joint Self-Supervised Temporal Domain Adaptation",
  "collection": "Domain Adaptation",
  "area": "General"
}
{
  "name": "Deep Boltzmann Machine",
  "full_name": "Deep Boltzmann Machine",
  "description": "A **Deep Boltzmann Machine (DBM)** is a three-layer generative model. It is similar to a [Deep Belief Network](https://paperswithcode.com/method/deep-belief-network), but instead allows bidirectional connections in the bottom layers. Its energy function is  as an extension of the energy function of the RBM:\r\n\r\n$$ E\\left(v, h\\right) = -\\sum^{i}\\_{i}v\\_{i}b\\_{i} - \\sum^{N}\\_{n=1}\\sum_{k}h\\_{n,k}b\\_{n,k}-\\sum\\_{i, k}v\\_{i}w\\_{ik}h\\_{k} - \\sum^{N-1}\\_{n=1}\\sum\\_{k,l}h\\_{n,k}w\\_{n, k, l}h\\_{n+1, l}$$\r\n\r\nfor a DBM with $N$ hidden layers.\r\n\r\nSource: [On the Origin of Deep Learning](https://arxiv.org/pdf/1702.07800.pdf)",
  "title": null,
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "CoaT",
  "full_name": "Co-Scale Conv-attentional Image Transformer",
  "description": "**Co-Scale Conv-Attentional Image Transformer** (CoaT) is a [Transformer](https://paperswithcode.com/method/transformer)-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other. Second, the conv-attentional mechanism is designed by realizing a relative position embedding formulation in the factorized attention module with an efficient [convolution](https://paperswithcode.com/method/convolution)-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities.",
  "title": "Co-Scale Conv-Attentional Image Transformers",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "SRM",
  "full_name": "style-based recalibration module",
  "description": "SRM combines style transfer with an attention mechanism. Its main contribution is style pooling which utilizes both mean and standard deviation of the input features to improve its capability to capture global information. It also adopts a lightweight channel-wise fully-connected (CFC) layer, in place of the original fully-connected layer, to reduce the computational requirements.\r\nGiven an input feature map $X \\in \\mathbb{R}^{C \\times H \\times W}$, SRM first collects global information by using style pooling ($\\text{SP}(\\cdot)$) which combines global average pooling and global standard deviation pooling. \r\nThen a channel-wise fully connected ($\\text{CFC}(\\cdot)$) layer (i.e. fully connected per channel), batch normalization $\\text{BN}$ and sigmoid function $\\sigma$ are used  to provide the attention vector. Finally,   as in an SE block, the input features are multiplied by the attention vector. Overall, an SRM can be written as:\r\n\\begin{align}\r\n    s = F_\\text{srm}(X, \\theta) & = \\sigma (\\text{BN}(\\text{CFC}(\\text{SP}(X))))\r\n\\end{align}\r\n\\begin{align}\r\n    Y & = s  X\r\n\\end{align}\r\nThe SRM block improves both squeeze and excitation modules, yet can be added after each residual unit like an SE block.",
  "title": "SRM: A Style-Based Recalibration Module for Convolutional Neural Networks",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Object Dropout",
  "full_name": "Object Dropout",
  "description": "Object Dropout is a technique that perturbs object features in an image for [noisy student](https://paperswithcode.com/method/noisy-student) training. It performs at par with standard data augmentation techniques while being significantly faster than the latter to implement.",
  "title": "Perturb, Predict & Paraphrase: Semi-Supervised Learning using Noisy Student for Image Captioning",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "Linear Regression",
  "full_name": "Linear Regression",
  "description": "**Linear Regression** is a method for modelling a relationship between a dependent variable and independent variables. These models can be fit with numerous approaches. The most common is *least squares*, where we minimize the mean square error between the predicted values $\\hat{y} = \\textbf{X}\\hat{\\beta}$ and actual values $y$: $\\left(y-\\textbf{X}\\beta\\right)^{2}$.\r\n\r\nWe can also define the problem in probabilistic terms as a generalized linear model (GLM) where the pdf is a Gaussian distribution, and then perform maximum likelihood estimation to estimate $\\hat{\\beta}$.\r\n\r\nImage Source: [Wikipedia](https://en.wikipedia.org/wiki/Linear_regression)",
  "title": null,
  "collection": "Generalized Linear Models",
  "area": "General"
}
{
  "name": "SASA",
  "full_name": "Stand-Alone Self Attention",
  "description": "**Stand-Alone Self Attention** (SASA) replaces all instances of spatial [convolution](https://paperswithcode.com/method/convolution) with a form of self-attention applied to [ResNet](https://paperswithcode.com/method/resnet) producing a fully, stand-alone self-attentional model.",
  "title": "Stand-Alone Self-Attention in Vision Models",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "SCA",
  "full_name": "Semantic Cross Attention",
  "description": "Semantic Cross Attention (SCA) is based on cross attention, which we restrict with respect to a semantic mask.\r\n\r\nThe goal of SCA is two-fold depending on what is the query and what is the key. Either it allows to give the feature map information from a semantically restricted set of latents or, respectively, it allows a set of latents to retrieve information in a semantically restricted region of the feature map. \r\n\r\nSCA is defined as:  \r\n\r\n\\begin{equation}\r\n    \\text{SCA}(I_{1}, I_{2}, I_{3}) = \\sigma\\left(\\frac{QK^T\\odot I_{3} +\\tau \\left(1-I_{3}\\right)}{\\sqrt{d_{in}}}\\right)V \\quad ,\r\n\\end{equation}\r\n\r\nwhere $I_{1},I_{2},I_{3}$ the inputs, with $I_{1}$ attending $I_{2}$, and $I_{3}$ the mask that forces tokens from $I_1$ to attend only specific tokens from $I_2$. The attention values requiring masking are filled with $-\\infty$ before the softmax. (In practice $\\tau{=}-10^9$),  $Q {=} W_QI_{1}$, $K {=} W_KI_{2}$ and $V {=} W_VI_{2}$ the queries, keys and values, and $d_{in}$ the internal attention dimension. $\\sigma(.)$ is the softmax operation.\r\n\r\nLet $X\\in\\mathbb{R}^{n\\times C}$ be the feature map with n the number of pixels, and C the number of channels. Let $Z\\in\\mathbb{R}^{m\\times d}$ be a set of $m$ latents of dimension $d$ and $s$ the number of semantic labels. Each semantic label is attributed $k$ latents, such that $m=k\\times s$. Each semantic label mask is assigned $k$ copies in $S{\\in}\\{0;1\\}^{n \\times m}$. \r\n\r\nWe can differentiate 3 types of SCA:\r\n\r\n(a) SCA with pixels $X$ attending latents $Z$: $\\text{SCA}(X, Z, S)$, where $W_{Q} {\\in} \\mathbb{R}^{n\\times d_{in}}$ and $W_{K}, W_{V} {\\in} \\mathbb{R}^{m\\times d_{in}}$.\r\nThe idea is to force the pixels from a semantic region to attend latents that are associated with the same label. \r\n\r\n(b) SCA with latents $Z$ attending pixels $X$: $\\text{SCA}(Z, X, S)$, where $W_{Q}{\\in} \\mathbb{R}^{m\\times d_{in}}$, $W_{K}, W_{V} {\\in} \\mathbb{R}^{n\\times d_{in}}$. \r\nThe idea is to semantically mask attention values to enforce latents to attend semantically corresponding pixels.\r\n\r\n(c) SCA with latents $Z$ attending themselves: $\\text{SCA}(Z, Z, M)$, where $W_{Q}, W_{K}, W_{V} {\\in} \\mathbb{R}^{n\\times d_{in}}$. We denote $M\\in\\mathbb{N}^{m\\times m}$ this mask, with $M_{\\text{latents}}(i,j) {=} 1$ if the semantic label of latent $i$ is the same as the one of latent $j$; $0$ otherwise.\r\nThe idea is to let the latents only attend latents that share the same semantic label.",
  "title": "SCAM! Transferring humans between images with Semantic Cross Attention Modulation",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "GraRep",
  "full_name": "Graph Representation with Global structure",
  "description": "",
  "title": "GraRep: Learning Graph Representations with Global Structural Information",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "ResNeXt Block",
  "full_name": "ResNeXt Block",
  "description": "A **ResNeXt Block** is a type of [residual block](https://paperswithcode.com/method/residual-block) used as part of the [ResNeXt](https://paperswithcode.com/method/resnext) CNN architecture. It uses a \"split-transform-merge\" strategy (branched paths within a single module) similar to an [Inception module](https://paperswithcode.com/method/inception-module), i.e. it aggregates a set of transformations. Compared to a Residual Block, it exposes a new dimension,  *cardinality* (size of set of transformations) $C$, as an essential factor in addition to depth and width. \r\n\r\nFormally, a set of aggregated transformations can be represented as: $\\mathcal{F}(x)=\\sum_{i=1}^{C}\\mathcal{T}_i(x)$, where $\\mathcal{T}_i(x)$ can be an arbitrary function. Analogous to a simple neuron, $\\mathcal{T}_i$ should project $x$ into an (optionally low-dimensional) embedding and then transform it.",
  "title": "Aggregated Residual Transformations for Deep Neural Networks",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "Grouped-query attention",
  "full_name": "Grouped-query attention",
  "description": "**Grouped-query attention** an interpolation of multi-query and multi-head attention that achieves quality close to multi-head at comparable speed to multi-query attention.",
  "title": "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints",
  "collection": "Attention",
  "area": "General"
}
{
  "name": "SLAMB",
  "full_name": "Sparse Layer-wise Adaptive Moments optimizer for large Batch training",
  "description": "Please enter a description about the method here",
  "title": "SLAMB: Accelerated Large Batch Training with Sparse Communication",
  "collection": "Large Batch Optimization",
  "area": "General"
}
{
  "name": "Polynomial Rate Decay",
  "full_name": "Polynomial Rate Decay",
  "description": "**Polynomial Rate Decay** is a learning rate schedule where we polynomially decay the learning rate.",
  "title": null,
  "collection": "Learning Rate Schedules",
  "area": "General"
}
{
  "name": "Scale Aggregation Block",
  "full_name": "Scale Aggregation Block",
  "description": "A **Scale Aggregation Block** concatenates feature maps at a wide range of scales. Feature maps for each scale are generated by a stack of downsampling, [convolution](https://paperswithcode.com/method/convolution) and upsampling operations. The proposed scale aggregation block is a standard computational module which readily replaces any given transformation $\\mathbf{Y}=\\mathbf{T}(\\mathbf{X})$, where $\\mathbf{X}\\in \\mathbb{R}^{H\\times W\\times C}$, $\\mathbf{Y}\\in \\mathbb{R}^{H\\times W\\times C_o}$ with $C$ and $C_o$ being the input and output channel number respectively. $\\mathbf{T}$ is any operator such as a convolution layer or a series of convolution layers. Assume we have $L$ scales. Each scale $l$ is generated by sequentially conducting a downsampling $\\mathbf{D}_l$, a transformation $\\mathbf{T}_l$ and an unsampling operator $\\mathbf{U}_l$:\r\n\r\n$$\r\n\\mathbf{X}^{'}_l=\\mathbf{D}_l(\\mathbf{X}),\r\n\\label{eq:eq_d}\r\n$$\r\n\r\n$$\r\n\\mathbf{Y}^{'}_l=\\mathbf{T}_l(\\mathbf{X}^{'}_l),\r\n\\label{eq:eq_tl}\r\n$$\r\n\r\n$$\r\n\\mathbf{Y}_l=\\mathbf{U}_l(\\mathbf{Y}^{'}_l),\r\n\\label{eq:eq_u}\r\n$$\r\n\r\nwhere $\\mathbf{X}^{'}_l\\in \\mathbb{R}^{H_l\\times W_l\\times C}$,\r\n$\\mathbf{Y}^{'}_l\\in \\mathbb{R}^{H_l\\times W_l\\times C_l}$, and\r\n$\\mathbf{Y}_l\\in \\mathbb{R}^{H\\times W\\times C_l}$.\r\nNotably, $\\mathbf{T}_l$ has the similar structure as $\\mathbf{T}$.\r\nWe can concatenate all $L$ scales together, getting\r\n\r\n$$\r\n\\mathbf{Y}^{'}=\\Vert^L_1\\mathbf{U}_l(\\mathbf{T}_l(\\mathbf{D}_l(\\mathbf{X}))),\r\n\\label{eq:eq_all}\r\n$$\r\n\r\nwhere $\\Vert$ indicates concatenating feature maps along the channel dimension, and $\\mathbf{Y}^{'} \\in \\mathbb{R}^{H\\times W\\times \\sum^L_1 C_l}$ is the final output feature maps of the scale aggregation block.\r\n\r\nIn the reference implementation, the downsampling $\\mathbf{D}_l$ with factor $s$ is implemented by a max pool layer with $s\\times s$ kernel size and  $s$ stride. The upsampling $\\mathbf{U}_l$ is implemented by resizing with the nearest neighbor  interpolation.",
  "title": "Data-Driven Neuron Allocation for Scale Aggregation Networks",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Segregated Attention Network",
  "full_name": "Segregated Attention Network",
  "description": "",
  "title": "InferNER: an attentive model leveraging the sentence-level information for Named Entity Recognition in Microblogs",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Targeted Dropout",
  "full_name": "Targeted Dropout",
  "description": "Please enter a description about the method here",
  "title": "Learning Sparse Networks Using Targeted Dropout",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "AggMo",
  "full_name": "AggMo",
  "description": "**Aggregated Momentum (AggMo)** is a variant of the [classical momentum](https://paperswithcode.com/method/sgd-with-momentum) stochastic optimizer which maintains several velocity vectors with different $\\beta$ parameters. AggMo averages the velocity vectors when updating the parameters. It resolves the problem of choosing a momentum parameter by taking a linear combination of multiple momentum buffers. Each of $K$ momentum buffers have a different discount factor $\\beta \\in \\mathbb{R}^{K}$, and these are averaged for the update. The update rule is:\r\n\r\n$$ \\textbf{v}\\_{t}^{\\left(i\\right)} = \\beta^{(i)}\\textbf{v}\\_{t-1}^{\\left(i\\right)} - \\nabla\\_{\\theta}f\\left(\\mathbf{\\theta}\\_{t-1}\\right) $$\r\n\r\n$$ \\mathbf{\\theta\\_{t}} = \\mathbf{\\theta\\_{t-1}} + \\frac{\\gamma\\_{t}}{K}\\sum^{K}\\_{i=1}\\textbf{v}\\_{t}^{\\left(i\\right)} $$\r\n\r\nwhere $v^{\\left(i\\right)}_{0}$ for each $i$. The vector $\\mathcal{\\beta} = \\left[\\beta^{(1)}, \\ldots, \\beta^{(K)}\\right]$ is the dampening factor.",
  "title": "Aggregated Momentum: Stability Through Passive Damping",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "True Online TD Lambda",
  "full_name": "True Online TD Lambda",
  "description": "**True Online $TD\\left(\\lambda\\right)$** seeks to approximate the ideal online $\\lambda$-return algorithm. It seeks to invert this ideal forward-view algorithm to produce an efficient backward-view algorithm using eligibility traces. It uses dutch traces rather than accumulating traces.\r\n\r\nSource: [Sutton and Seijen](http://proceedings.mlr.press/v32/seijen14.pdf)",
  "title": null,
  "collection": "On-Policy TD Control",
  "area": "Reinforcement Learning"
}
{
  "name": "DualCL",
  "full_name": "Dual Contrastive Learning",
  "description": "Contrastive learning has achieved remarkable success in representation learning via self-supervision in unsupervised settings. However, effectively adapting contrastive learning to supervised learning tasks remains as a challenge in practice. In this work, we introduce a dual contrastive learning (DualCL) framework that simultaneously learns the features of input samples and the parameters of classifiers in the same space. Specifically, DualCL regards the parameters of the classifiers as augmented samples associating to different labels and then exploits the contrastive learning between the input samples and the augmented samples. Empirical studies on five benchmark text classification datasets and their low-resource version demonstrate the improvement in classification accuracy and confirm the capability of learning discriminative representations of DualCL.",
  "title": "Dual Contrastive Learning: Text Classification via Label-Aware Data Augmentation",
  "collection": "Text Classification Models",
  "area": "Natural Language Processing"
}
{
  "name": "ABCNet",
  "full_name": "Adaptive Bezier-Curve Network",
  "description": "**Adaptive Bezier-Curve Network**, or **ABCNet**, is an end-to-end framework for arbitrarily-shaped scene text spotting. It adaptively fits arbitrary-shaped text by a parameterized bezier curve. It also utilizes a feature alignment layer, [BezierAlign](https://paperswithcode.com/method/bezieralign), to calculate convolutional features of text instances in curved shapes. These features are then passed to a light-weight recognition head.",
  "title": "ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network",
  "collection": "Scene Text Models",
  "area": "Computer Vision"
}
{
  "name": "ZFNet",
  "full_name": "ZFNet",
  "description": "**ZFNet** is a classic convolutional neural network. The design was motivated by visualizing intermediate feature layers and the operation of the classifier. Compared to [AlexNet](https://paperswithcode.com/method/alexnet), the filter sizes are reduced and the stride of the convolutions are reduced.",
  "title": "Visualizing and Understanding Convolutional Networks",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Global Average Pooling",
  "full_name": "Global Average Pooling",
  "description": "**Global Average Pooling** is a pooling operation designed to replace fully connected layers in classical CNNs. The idea is to generate one feature map for each corresponding category of the classification task in the last mlpconv layer. Instead of adding fully connected layers on top of the feature maps, we take the average of each feature map, and the resulting vector is fed directly into the [softmax](https://paperswithcode.com/method/softmax) layer. \r\n\r\nOne advantage of global [average pooling](https://paperswithcode.com/method/average-pooling) over the fully connected layers is that it is more native to the [convolution](https://paperswithcode.com/method/convolution) structure by enforcing correspondences between feature maps and categories. Thus the feature maps can be easily interpreted as categories confidence maps. Another advantage is that there is no parameter to optimize in the global average pooling thus overfitting is avoided at this layer. Furthermore, global average pooling sums out the spatial information, thus it is more robust to spatial translations of the input.",
  "title": "Network In Network",
  "collection": "Pooling Operations",
  "area": "Computer Vision"
}
{
  "name": "Supervised Contrastive Loss",
  "full_name": "Supervised Contrastive Loss",
  "description": "**Supervised Contrastive Loss** is an alternative loss function to cross entropy that the authors argue can leverage label information more effectively. Clusters of points belonging to the same class are pulled together in embedding space, while simultaneously pushing apart clusters of samples from different classes.\r\n\r\n$$\r\n  \\mathcal{L}^{sup}=\\sum_{i=1}^{2N}\\mathcal{L}_i^{sup}\r\n  \\label{eqn:total_supervised_loss}\r\n$$\r\n\r\n$$\r\n  \\mathcal{L}\\_i^{sup}=\\frac{-1}{2N\\_{\\boldsymbol{\\tilde{y}}\\_i}-1}\\sum\\_{j=1}^{2N}\\mathbf{1}\\_{i\\neq j}\\cdot\\mathbf{1}\\_{\\boldsymbol{\\tilde{y}}\\_i=\\boldsymbol{\\tilde{y}}_j}\\cdot\\log{\\frac{\\exp{\\left(\\boldsymbol{z}\\_i\\cdot\\boldsymbol{z}\\_j/\\tau\\right)}}{\\sum\\_{k=1}^{2N}\\mathbf{1}\\_{i\\neq k}\\cdot\\exp{\\left(\\boldsymbol{z}\\_i\\cdot\\boldsymbol{z}\\_k/\\tau\\right)}}}\r\n$$\r\n\r\nwhere $N_{\\boldsymbol{\\tilde{y}}_i}$ is the total number of images in the minibatch that have the same label, $\\boldsymbol{\\tilde{y}}_i$, as the anchor, $i$. This loss has important properties well suited for supervised learning: (a) generalization to an arbitrary number of positives, (b) contrastive power increases with more negatives.",
  "title": "Supervised Contrastive Learning",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "Mogrifier LSTM",
  "full_name": "Mogrifier LSTM",
  "description": "The **Mogrifier LSTM** is an extension to the [LSTM](https://paperswithcode.com/method/lstm) where the LSTM’s input $\\mathbf{x}$ is gated conditioned on the output of the previous step $\\mathbf{h}\\_{prev}$. Next, the gated input is used in a similar manner to gate the output of the\r\nprevious time step. After a couple of rounds of this mutual gating, the last updated $\\mathbf{x}$ and $\\mathbf{h}\\_{prev}$ are fed to an LSTM.  \r\n\r\nIn detail, the Mogrifier is an LSTM where two inputs $\\mathbf{x}$ and $\\mathbf{h}\\_{prev}$ modulate one another in an alternating fashion before the usual LSTM computation takes place. That is: $ \\text{Mogrify}\\left(\\mathbf{x}, \\mathbf{c}\\_{prev}, \\mathbf{h}\\_{prev}\\right) = \\text{LSTM}\\left(\\mathbf{x}^{↑}, \\mathbf{c}\\_{prev}, \\mathbf{h}^{↑}\\_{prev}\\right)$ where the modulated inputs $\\mathbf{x}^{↑}$ and $\\mathbf{h}^{↑}\\_{prev}$ are defined as the highest indexed $\\mathbf{x}^{i}$ and $\\mathbf{h}^{i}\\_{prev}$, respectively, from the interleaved sequences:\r\n\r\n$$ \\mathbf{x}^{i} = 2\\sigma\\left(\\mathbf{Q}^{i}\\mathbf{h}^{i−1}\\_{prev}\\right) \\odot x^{i-2} \\text{ for odd } i \\in \\left[1 \\dots r\\right] $$\r\n\r\n$$ \\mathbf{h}^{i}\\_{prev}  = 2\\sigma\\left(\\mathbf{R}^{i}\\mathbf{x}^{i-1}\\right) \\odot \\mathbf{h}^{i-2}\\_{prev} \\text{ for even } i \\in \\left[1 \\dots r\\right] $$\r\n\r\nwith $\\mathbf{x}^{-1} = \\mathbf{x}$ and $\\mathbf{h}^{0}\\_{prev} = \\mathbf{h}\\_{prev}$. The number of \"rounds\", $r \\in \\mathbb{N}$, is a hyperparameter; $r = 0$ recovers the LSTM. Multiplication with the constant 2 ensures that randomly initialized $\\mathbf{Q}^{i}$, $\\mathbf{R}^{i}$ matrices result in transformations close to identity. To reduce the number of additional model parameters, we typically factorize the $\\mathbf{Q}^{i}$, $\\mathbf{R}^{i}$ matrices as products of low-rank matrices: $\\mathbf{Q}^{i}$ =\r\n$\\mathbf{Q}^{i}\\_{left}\\mathbf{Q}^{i}\\_{right}$ with $\\mathbf{Q}^{i} \\in \\mathbb{R}^{m\\times{n}}$, $\\mathbf{Q}^{i}\\_{left} \\in \\mathbb{R}^{m\\times{k}}$, $\\mathbf{Q}^{i}\\_{right} \\in \\mathbb{R}^{k\\times{n}}$, where $k < \\min\\left(m, n\\right)$ is the rank.",
  "title": "Mogrifier LSTM",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "CuBERT",
  "full_name": "CuBERT",
  "description": "**CuBERT**, or **Code Understanding BERT**, is a [BERT](https://paperswithcode.com/method/bert) based model for code understanding. In order to achieve this, the authors curate a massive corpus of Python programs collected from GitHub. GitHub projects are known to contain a large amount of duplicate code. To avoid biasing the model to such duplicated code, authors perform deduplication using the method of [Allamanis (2018)](https://arxiv.org/abs/1812.06469). The resulting corpus has 7.4 million files with a total of 9.3 billion tokens (16 million unique).",
  "title": "Learning and Evaluating Contextual Embedding of Source Code",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "Multi Loss ( BCE Loss + Focal Loss )  + Dice Loss",
  "full_name": "Multi Loss ( BCE Loss + Focal Loss )  + Dice Loss",
  "description": "Our proposed loss function is a combination of BCE Loss, Focal Loss, and Dice loss. Each one of them contributes individually to improve performance further details of loss functions are mentioned below,\r\n\r\n(1) BCE Loss calculates probabilities and compares each actual class output with predicted probabilities which can be either 0 or 1, it is based on Bernoulli distribution loss, it is mostly used when there are only two classes are available in our case there are exactly two classes are available one is background and other is foreground. In a proposed method it is used for pixel-level classification.\r\n\r\n(2) Focal Loss is a variant of BCE, it enables the model to focus on learning hard examples by decreasing the wights of easy examples it works well when the data is highly imbalanced.\r\n\r\n(3) Dice Loss is inspired by the Dice Coefficient Score which is an evaluation metric used to evaluate the results of image segmentation tasks. Dice Coefficient is convex in nature so it has been changed, so it can be more traceable. It is used to calculate the similarity between two images, Dice Loss represent as\r\n\r\n\r\nWe proposed a Loss function which is a combination of all three above mention loss functions to benefit from all, BCE is used for pixel-wise classification, Focal Loss is used for learning hard examples, we use 0.25 as the value for alpha and 2.0 as the value of gamma. Dice Loss is used for learning better boundary representation, our proposed loss function represent as\r\n\\begin{equation}\r\nLoss = \\left( BCE Loss + Focal Loss \\right)  + Dice Loss\r\n\\end{equation}",
  "title": "HistoSeg : Quick attention with multi-loss function for multi-structure segmentation in digital histology images",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "ETC",
  "full_name": "Extended Transformer Construction",
  "description": "**Extended Transformer Construction**, or **ETC**, is an extension of the [Transformer](https://paperswithcode.com/method/transformer) architecture with a new attention mechanism that extends the original in two main ways: (1) it allows scaling up the input length from 512 to several thousands; and (2) it can ingesting structured inputs instead of just linear sequences. The key ideas that enable ETC to achieve these are a new [global-local attention mechanism](https://paperswithcode.com/method/global-local-attention), coupled with [relative position encodings](https://paperswithcode.com/method/relative-position-encodings). ETC also allows lifting weights from existing [BERT](https://paperswithcode.com/method/bert) models, saving computational resources while training.",
  "title": "ETC: Encoding Long and Structured Inputs in Transformers",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Focal Loss",
  "full_name": "Focal Loss",
  "description": "A **Focal Loss** function addresses class imbalance during training in tasks like object detection. Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard misclassified examples. It is a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples. \r\n\r\nFormally, the Focal Loss adds a factor $(1 - p\\_{t})^\\gamma$ to the standard cross entropy criterion. Setting $\\gamma>0$ reduces the relative loss for well-classified examples ($p\\_{t}>.5$), putting more focus on hard, misclassified examples. Here there is tunable *focusing* parameter $\\gamma \\ge 0$. \r\n\r\n$$ {\\text{FL}(p\\_{t}) = - (1 - p\\_{t})^\\gamma \\log\\left(p\\_{t}\\right)} $$",
  "title": "Focal Loss for Dense Object Detection",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "Good Feature Matching",
  "full_name": "Good Feature Matching",
  "description": "**Good Feature Matching** is an active map-to-frame feature matching method. Feature matching effort is tied to submatrix selection, which has combinatorial time complexity and requires choosing a scoring metric. Via simulation, the Max-logDet matrix revealing metric is shown to perform best.",
  "title": "Good Feature Matching: Towards Accurate, Robust VO/VSLAM with Low Latency",
  "collection": "Feature Matching",
  "area": "General"
}
{
  "name": "Singular Value Clipping",
  "full_name": "Singular Value Clipping",
  "description": "**Singular Value Clipping (SVC)** is an adversarial training technique used by [TGAN](https://paperswithcode.com/method/tgan) to enforce the 1-Lipschitz constraint of the [WGAN](https://paperswithcode.com/method/wgan) objective. It is a constraint to all linear layers in the discriminator that satisfies the spectral norm of weight parameter $W$ is equal or less than one. This\r\nmeans that the singular values of weight matrix are all one or less. Therefore singular value decomposition (SVD) is performed after a parameter update, replacing all the singular values larger than one with one, and the parameters are reconstructed with them. The same operation is applied to convolutional layers by interpreting a higher order tensor in weight parameter as a matrix $\\hat{W}$.",
  "title": "Temporal Generative Adversarial Nets with Singular Value Clipping",
  "collection": "Adversarial Training",
  "area": "General"
}
{
  "name": "ARiA",
  "full_name": "Adaptive Richard's Curve Weighted Activation",
  "description": "This work introduces a novel activation unit that can be efficiently employed in deep neural nets (DNNs) and performs significantly better than the traditional Rectified Linear Units ([ReLU](https://paperswithcode.com/method/relu)). The function developed is a two parameter version of the specialized Richard's Curve and we call it Adaptive Richard's Curve weighted Activation (ARiA). This function is non-monotonous, analogous to the newly introduced [Swish](https://paperswithcode.com/method/swish), however allows a precise control over its non-monotonous convexity by varying the hyper-parameters. We first demonstrate the mathematical significance of the two parameter ARiA followed by its application to benchmark problems such as MNIST, CIFAR-10 and CIFAR-100, where we compare the performance with ReLU and Swish units. Our results illustrate a significantly superior performance on all these datasets, making ARiA a potential replacement for ReLU and other activations in DNNs.",
  "title": "ARiA: Utilizing Richard's Curve for Controlling the Non-monotonicity of the Activation Function in Deep Neural Nets",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "DeCLUTR",
  "full_name": "DeCLUTR",
  "description": "**DeCLUTR** is an approach for learning universal sentence embeddings that utilizes a self-supervised objective that does not require labelled training data. The objective learns universal sentence embeddings by training an encoder to minimize the distance between the embeddings of textual segments randomly sampled from nearby in the same document.",
  "title": "DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "WaveTTS",
  "full_name": "WaveTTS",
  "description": "**WaveTTS** is a [Tacotron](https://paperswithcode.com/method/tacotron)-based text-to-speech architecture that has two loss functions: 1) time-domain loss, denoted as the waveform loss, that measures the distortion between the natural and generated waveform; and 2) frequency-domain loss, that measures the Mel-scale acoustic feature loss between the natural and generated acoustic features.\r\n\r\nThe motivation arises from [Tacotron 2](https://paperswithcode.com/method/tacotron-2). Here its feature prediction network is trained independently of the [WaveNet](https://paperswithcode.com/method/wavenet) vocoder. At run-time, the feature prediction network and WaveNet vocoder are artificially joined together. As a result, the framework suffers from the mismatch between frequency-domain acoustic features and time-domain waveform. To overcome such mismatch, WaveTTS uses a joint time-frequency domain loss for TTS that effectively improves the synthesized voice quality.",
  "title": "WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss",
  "collection": "Text-to-Speech Models",
  "area": "Audio"
}
{
  "name": "TSRUp",
  "full_name": "TSRUp",
  "description": "**TSRUp**, or **Transformation-based Spatial Recurrent Unit p**, is a modification of a [ConvGRU](https://paperswithcode.com/method/cgru) used in the [TriVD-GAN](https://paperswithcode.com/method/trivd-gan) architecture for video generation.\r\n\r\nIt largely follows [TSRUc](https://paperswithcode.com/method/tsruc), but computes $\\theta$, $u$ and $c$ in parallel given $x\\_{t}$ and $h\\_{t−1}$, yielding the following replacement for the $c$ update equation:\r\n\r\n$$ c = \\rho\\left(W\\_{c} \\star\\_{n}\\left[h\\_{t-1}; x\\_{t}\\right] + b\\_{c} \\right) $$\r\n\r\nIn these equations $\\sigma$ and $\\rho$ are the elementwise sigmoid and [ReLU](https://paperswithcode.com/method/relu) functions respectively and the $\\star\\_{n}$ represents a [convolution](https://paperswithcode.com/method/convolution) with a kernel of size $n \\times n$. Brackets are used to represent a feature concatenation.",
  "title": "Transformation-based Adversarial Video Prediction on Large-Scale Data",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "Panoptic-PolarNet",
  "full_name": "Panoptic-PolarNet",
  "description": "**Panoptic-PolarNet** is a point cloud segmentation framework for LiDAR point clouds. It learns both semantic segmentation and class-agnostic instance clustering in a single inference network using a polar Bird's Eye View (BEV) representation, enabling the authors to circumvent the issue of occlusion among instances in urban street scenes. We first encode the raw point cloud data with $K$ features into a fixed-size representation on the polar BEV map. Next, we use a single backbone encoder-decoder network to generate semantic prediction, center [heatmap](https://paperswithcode.com/method/heatmap) and offset regression. Finally, we merge these outputs via a voting-based fusion to yield the panoptic segmentation result.",
  "title": "Panoptic-PolarNet: Proposal-free LiDAR Point Cloud Panoptic Segmentation",
  "collection": "Point Cloud Models",
  "area": "Computer Vision"
}
{
  "name": "GPT-2",
  "full_name": "GPT-2",
  "description": "**GPT-2** is a [Transformer](https://paperswithcode.com/methods/category/transformers) architecture that was notable for its size (1.5 billion parameters) on its release. The model is pretrained on a WebText dataset - text from 45 million website links. It largely follows the previous [GPT](https://paperswithcode.com/method/gpt) architecture with some modifications:\r\n\r\n- [Layer normalization](https://paperswithcode.com/method/layer-normalization) is moved to the input of each sub-block, similar to a\r\npre-activation residual network and an additional layer normalization was added after the final self-attention block. \r\n\r\n- A modified initialization which accounts for the accumulation on the residual path with model depth\r\nis used. Weights of residual layers are scaled at initialization by a factor of $1/\\sqrt{N}$ where $N$ is the number of residual layers. \r\n\r\n- The vocabulary is expanded to 50,257. The context size is expanded from 512 to 1024 tokens and\r\na larger batch size of 512 is used.",
  "title": "Language Models are Unsupervised Multitask Learners",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "VirTex",
  "full_name": "VirTex",
  "description": "**VirText**, or **Visual representations from Textual annotations** is a pretraining approach using semantically dense captions to learn visual representations. First a ConvNet and [Transformer](https://paperswithcode.com/method/transformer) are jointly trained from scratch to generate natural language captions for images. Then, the learned features are transferred to downstream visual recognition tasks.",
  "title": "VirTex: Learning Visual Representations from Textual Annotations",
  "collection": "Image Representations",
  "area": "Computer Vision"
}
{
  "name": "OODformer",
  "full_name": "OODformer",
  "description": "OODformer is a [transformer](https://paperswithcode.com/method/transformer)-based OOD detection architecture that leverages the contextualization capabilities of the transformer. Incorporating the transformer as the principal feature extractor allows to exploit the object concepts and their discriminate attributes along with their co-occurrence via [visual attention](https://paperswithcode.com/method/visual-attention). \r\n\r\nOODformer employs [ViT](method/vision-transformer) and its data efficient variant [DeiT](/method/deit). Each encoder layer consist of multi-head self attention and a multi-layer perception block. The combination of MSA and MLP layers in the encoder jointly encode the attributes' importance, associated correlation, and co-occurrence. The [class] token (a representative of an image $x$) consolidated multiple attributes and their related features via the global context. The [class] token from the final layer is used for OOD detection in two ways; first, it is passed to $\r\nF_{\\text {classifier }}\\left(x_{\\text {feat }}\\right)$  for softmax confidence score, and second it is used for latent space distance calculation.",
  "title": "OODformer: Out-Of-Distribution Detection Transformer",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "Stacked Hourglass Network",
  "full_name": "Stacked Hourglass Network",
  "description": "**Stacked Hourglass Networks** are a type of convolutional neural network for pose estimation. They are based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.",
  "title": "Stacked Hourglass Networks for Human Pose Estimation",
  "collection": "Pose Estimation Models",
  "area": "Computer Vision"
}
{
  "name": "Residual GRU",
  "full_name": "Residual GRU",
  "description": "A **Residual GRU** is a [gated recurrent unit (GRU)](https://paperswithcode.com/method/gru) that incorporates the idea of residual connections from [ResNets](https://paperswithcode.com/method/resnet).",
  "title": "Full Resolution Image Compression with Recurrent Neural Networks",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "CSL",
  "full_name": "Circular Smooth Label",
  "description": "**Circular Smooth Label** (CSL) is a classification-based rotation detection technique for arbitrary-oriented object detection. It is used for circularly distributed angle classification and addresses the periodicity of the angle and increases the error tolerance to adjacent angles.",
  "title": "On the Arbitrary-Oriented Object Detection: Classification based Approaches Revisited",
  "collection": "Arbitrary Object Detectors",
  "area": "Computer Vision"
}
{
  "name": "SCNet",
  "full_name": "SCNet",
  "description": "**Sample Consistency Network (SCNet)** is a method for instance segmentation which ensures the IoU distribution of the samples at training time are as close to that at inference time. To this end, only the outputs of the last box stage are used for mask predictions at both training and inference. The Figure shows the IoU distribution of the samples going to the mask branch at training time with/without sample consistency compared to that at inference time.",
  "title": "SCNet: Training Inference Sample Consistency for Instance Segmentation",
  "collection": "Instance Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "PipeDream",
  "full_name": "PipeDream",
  "description": "PipeDream is an asynchronous pipeline parallel strategy for training large neural networks. It adds inter-batch pipelining to intra-batch parallelism to further improve parallel training throughput, helping to better overlap computation with communication and reduce the amount of communication when possible.",
  "title": null,
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "Transductive Inference",
  "full_name": "Transductive Inference",
  "description": "",
  "title": "Transductive Inference and Semi-Supervised Learning",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "Branch attention",
  "full_name": "Branch attention",
  "description": "Branch attention can be seen as a dynamic branch selection mechanism: which to pay attention to, used with a multi-branch structure.",
  "title": "Training Very Deep Networks",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Residual SRM",
  "full_name": "Residual SRM",
  "description": "A **Residual SRM** is a module for convolutional neural networks that uses a [Style-based Recalibration Module](https://paperswithcode.com/method/style-based-recalibration-module) within a [residual block](https://paperswithcode.com/method/residual-block) like structure. The Style-based Recalibration Module (SRM) adaptively recalibrates intermediate feature maps by exploiting their styles.",
  "title": "SRM : A Style-based Recalibration Module for Convolutional Neural Networks",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "Gated Convolution",
  "full_name": "Gated Convolution",
  "description": "A **Gated Convolution** is a type of temporal [convolution](https://paperswithcode.com/method/convolution) with a gating mechanism. Zero-padding is used to ensure that future context can not be seen.",
  "title": "Language Modeling with Gated Convolutional Networks",
  "collection": "Temporal Convolutions",
  "area": "Sequential"
}
{
  "name": "Viewmaker Network",
  "full_name": "Viewmaker Network",
  "description": "**Viewmaker Network** is a type of generative model that learns to produce input-dependent views for contrastive learning. This network is trained jointly with an encoder network. The viewmaker network is trained adversarially to create views which increase the contrastive loss of the encoder network. Rather than directly outputting views for an image, the viewmaker instead outputs a stochastic perturbation that is added to the input. This perturbation is projected onto an $\\mathcal{l}\\_{p}$ sphere, controlling the effective strength of the view, similar to methods in adversarial robustness. This constrained adversarial training method enables the model to reduce the mutual information between different views while preserving useful input features for the encoder to learn from.\r\n\r\nSpecifically, the encoder and viewmaker are optimized in alternating steps to minimize and maximize $\\mathcal{L}$, respectively. An image-to-image neural network is used as the viewmaker network, with an architecture adapted from work on style transfer. This network ingests the input image and outputs a perturbation that is constrained to an $\\ell_{1}$ sphere. The sphere's radius is determined by the volume of the input tensor times a hyperparameter $\\epsilon$, the distortion budget, which determines the strength of the applied perturbation. This perturbation is added to the input image and optionally clamped in the case of images to ensure all pixels are in $[0,1]$.",
  "title": "Viewmaker Networks: Learning Views for Unsupervised Representation Learning",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "SMA",
  "full_name": "Slime Mould Algorithm",
  "description": "**Slime Mould Algorithm** (**SMA**) is a new stochastic optimizer proposed based on the oscillation mode of slime mould in nature. SMA has several new features with a unique mathematical model that uses adaptive weights to simulate the process of producing positive and negative feedback of the propagation wave of slime mould based on bio-oscillator to form the optimal path for connecting food with excellent exploratory ability and exploitation propensity.\r\n\r\n🔗 The source codes of SMA are publicly available at [https://aliasgharheidari.com/SMA.html](https://aliasgharheidari.com/SMA.html)",
  "title": null,
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "CDEP",
  "full_name": "Contextual Decomposition Explanation Penalization",
  "description": "**Contextual Decomposition Explanation Penalization (CDEP)** is a method which leverages existing explanation techniques for neural networks in order to prevent a model from learning\r\nunwanted relationships and ultimately improve predictive accuracy. Given particular importance\r\nscores, CDEP works by allowing the user to directly penalize importances of certain features, or\r\ninteractions. This forces the neural network to not only produce the correct prediction, but also the\r\ncorrect explanation for that prediction",
  "title": "Interpretations are useful: penalizing explanations to align neural networks with prior knowledge",
  "collection": "Interpretability",
  "area": "General"
}
{
  "name": "Early Dropout",
  "full_name": "Early Dropout",
  "description": "Introduced by Hinton et al. in 2012, dropout has stood the test of time as a regularizer for preventing overfitting in neural networks. In this study, we demonstrate that dropout can also mitigate underfitting when used at the start of training. During the early phase, we find dropout reduces the directional variance of gradients across mini-batches and helps align the mini-batch gradients with the entire dataset's gradient. This helps counteract the stochasticity of SGD and limit the influence of individual batches on model training. Our findings lead us to a solution for improving performance in underfitting models - early dropout: dropout is applied only during the initial phases of training, and turned off afterwards. Models equipped with early dropout achieve lower final training loss compared to their counterparts without dropout. Additionally, we explore a symmetric technique for regularizing overfitting models - late dropout, where dropout is not used in the early iterations and is only activated later in training. Experiments on ImageNet and various vision tasks demonstrate that our methods consistently improve generalization accuracy. Our results encourage more research on understanding regularization in deep learning and our methods can be useful tools for future neural network training, especially in the era of large data. Code is available at https://github.com/facebookresearch/dropout .",
  "title": "Dropout Reduces Underfitting",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Network Dissection",
  "full_name": "Network Dissection",
  "description": "**Network Dissection** is an interpretability method for [CNNs](https://paperswithcode.com/methods/category/convolutional-neural-networks) that evaluates the alignment between individual hidden units and a set of visual semantic concepts. By identifying the best alignments, units are given human interpretable labels across a range of objects, parts, scenes, textures, materials, and colors. \r\n\r\nThe measurement of interpretability proceeds in three steps:\r\n\r\n- Identify a broad set of human-labeled visual concepts.\r\n- Gather the response of the hidden variables to known concepts.\r\n- Quantify alignment of hidden variable−concept pairs.",
  "title": "Interpreting Deep Visual Representations via Network Dissection",
  "collection": "Interpretability",
  "area": "General"
}
{
  "name": "FBNet",
  "full_name": "FBNet",
  "description": "**FBNet** is a type of convolutional neural architectures discovered through [DNAS](https://paperswithcode.com/method/dnas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search). It utilises a basic type of image model block inspired by [MobileNetv2](https://paperswithcode.com/method/mobilenetv2) that utilises depthwise convolutions and an inverted residual structure (see components).",
  "title": "FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "DouZero",
  "full_name": "DouZero",
  "description": "**DouZero** is an AI system for the card game DouDizhu that enhances traditional Monte-Carlo methods with deep neural networks, action encoding, and parallel actors. The [Q-network](https://paperswithcode.com/method/dqn) of DouZero consists of an [LSTM](https://paperswithcode.com/method/lstm) to encode historical actions and six layers of [MLP](https://paperswithcode.com/method/feedforward-network) with hidden dimension of 512. The network predicts a value for a given state-action pair based on the concatenated representation of action and state.",
  "title": "DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning",
  "collection": "Card Game Models",
  "area": "Reinforcement Learning"
}
{
  "name": "GELU",
  "full_name": "Gaussian Error Linear Units",
  "description": "The **Gaussian Error Linear Unit**, or **GELU**,  is an activation function. The GELU activation function is $x\\Phi(x)$, where $\\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their percentile, rather than gates inputs by their sign as in [ReLUs](https://paperswithcode.com/method/relu) ($x\\mathbf{1}_{x>0}$). Consequently the GELU can be thought of as a smoother ReLU.\r\n\r\n$$\\text{GELU}\\left(x\\right) = x{P}\\left(X\\leq{x}\\right) = x\\Phi\\left(x\\right) = x \\cdot \\frac{1}{2}\\left[1 + \\text{erf}(x/\\sqrt{2})\\right],$$\r\nif $X\\sim \\mathcal{N}(0,1)$.\r\n\r\nOne can approximate the GELU with\r\n$0.5x\\left(1+\\tanh\\left[\\sqrt{2/\\pi}\\left(x + 0.044715x^{3}\\right)\\right]\\right)$ or $x\\sigma\\left(1.702x\\right),$\r\nbut PyTorch's exact implementation is sufficiently fast such that these approximations may be unnecessary. (See also the [SiLU](https://paperswithcode.com/method/silu) $x\\sigma(x)$ which was also coined in the paper that introduced the GELU.)\r\n\r\nGELUs are used in [GPT-3](https://paperswithcode.com/method/gpt-3), [BERT](https://paperswithcode.com/method/bert), and most other Transformers.",
  "title": "Gaussian Error Linear Units (GELUs)",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Exponential Decay",
  "full_name": "Exponential Decay",
  "description": "**Exponential Decay** is a learning rate schedule where we decay the learning rate with more iterations using an exponential function:\r\n\r\n$$ \\text{lr} = \\text{lr}\\_{0}\\exp\\left(-kt\\right) $$\r\n\r\nImage Credit: [Suki Lau](https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1)",
  "title": null,
  "collection": "Learning Rate Schedules",
  "area": "General"
}
{
  "name": "Funnel Transformer",
  "full_name": "Funnel Transformer",
  "description": "**Funnel Transformer** is a type of [Transformer](https://paperswithcode.com/methods/category/transformers) that gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. By re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, the model capacity is further improved. In addition, to perform token-level predictions as required by common pretraining objectives, Funnel-[transformer](https://paperswithcode.com/method/transformer) is able to recover a deep representation for each token from the reduced hidden sequence via a decoder.\r\n\r\nThe proposed model keeps the same overall skeleton of interleaved S-[Attn](https://paperswithcode.com/method/scaled) and P-[FFN](https://paperswithcode.com/method/dense-connections) sub-modules wrapped by [residual connection](https://paperswithcode.com/method/residual-connection) and [layer normalization](https://paperswithcode.com/method/layer-normalization). But differently, to achieve representation compression and computation reduction, THE model employs an encoder that gradually reduces the sequence length of the hidden states as the layer gets deeper. In addition, for tasks involving per-token predictions like pretraining, a simple decoder is used to reconstruct a full sequence of token-level representations from the compressed encoder output. Compression is achieved via a pooling operation,",
  "title": "Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "DIME",
  "full_name": "Distance to Modelled Embedding",
  "description": "**DIME**, or **Distance to Modelled Embedding**, is a method for detecting out-of-distribution examples during prediction time. Given a trained neural network, the training data drawn from some high-dimensional distribution in data space $X$ is transformed into the model’s intermediate feature vector space $\\mathbb{R}^{p}$. The training set embedding is linearly approximated as a hyperplane. When we then receive new observations it is difficult to assess if observations are out-of-distribution directly in data space, so we transform them into the same intermediate feature space. Finally, the Distance-to-Modelled-Embedding (DIME) can be used to assess whether new observations fit into the expected embedding covariance structure.",
  "title": "Out-of-Distribution Example Detection in Deep Neural Networks using Distance to Modelled Embedding",
  "collection": "Out-of-Distribution Example Detection",
  "area": "General"
}
{
  "name": "QHM",
  "full_name": "QHM",
  "description": "**Quasi-Hyperbolic Momentum (QHM)** is a stochastic optimization technique that alters [momentum SGD](https://paperswithcode.com/method/sgd-with-momentum) with a momentum step, averaging an [SGD](https://paperswithcode.com/method/sgd) step with a momentum step:\r\n\r\n$$ g\\_{t+1} = \\beta{g\\_{t}} + \\left(1-\\beta\\right)\\cdot{\\nabla}\\hat{L}\\_{t}\\left(\\theta\\_{t}\\right) $$\r\n$$ \\theta\\_{t+1} = \\theta\\_{t} - \\alpha\\left[\\left(1-v\\right)\\cdot\\nabla\\hat{L}\\_{t}\\left(\\theta\\_{t}\\right) + v\\cdot{g\\_{t+1}}\\right]$$\r\n\r\nThe authors suggest a rule of thumb of $v = 0.7$ and $\\beta = 0.999$.",
  "title": "Quasi-hyperbolic momentum and Adam for deep learning",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Swin Transformer",
  "full_name": "Swin Transformer",
  "description": "The **Swin Transformer** is a type of [Vision Transformer](https://paperswithcode.com/method/vision-transformer). It builds hierarchical feature maps by merging image patches (shown in gray) in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window (shown in red). It can thus serve as a general-purpose backbone for both image classification and dense recognition tasks. In contrast, previous vision Transformers produce feature maps of a single low resolution and have quadratic computation complexity to input image size due to computation of self-attention globally.",
  "title": "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "CenterNet",
  "full_name": "CenterNet",
  "description": "**CenterNet** is a one-stage object detector that detects each object as a triplet, rather than a pair, of keypoints. It utilizes two customized modules named [cascade corner pooling](https://paperswithcode.com/method/cascade-corner-pooling) and [center pooling](https://paperswithcode.com/method/center-pooling), which play the roles of enriching information collected by both top-left and bottom-right corners and providing more recognizable information at the central regions, respectively. The intuition is that, if a predicted bounding box has a high IoU with the ground-truth box, then the probability that the center keypoint in its central region is predicted as the same class is high, and vice versa. Thus, during inference, after a proposal is generated as a pair of corner keypoints, we determine if the proposal is indeed an object by checking if there is a center keypoint of the same class falling within its central region.",
  "title": "CenterNet: Keypoint Triplets for Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "CubeRE",
  "full_name": "CubeRE",
  "description": "Our model known as CubeRE first encodes each input sentence using a language model encoder to obtain the contextualized sequence representation. We then capture the interaction between each possible head and tail entity as a pair representation for predicting the entity-relation label scores. To reduce the computational cost, each sentence is pruned to retain only words that have higher entity scores. Finally, we capture the interaction between each possible relation triplet and qualifier to predict the qualifier label scores and decode the outputs.",
  "title": "A Dataset for Hyper-Relational Extraction and a Cube-Filling Approach",
  "collection": "Relation Extraction Models",
  "area": "Natural Language Processing"
}
{
  "name": "ReLIC",
  "full_name": "ReLIC",
  "description": "**ReLIC**, or **Representation Learning via Invariant Causal Mechanisms**, is a self-supervised learning objective that enforces invariant prediction of proxy targets across augmentations through an invariance regularizer which yields improved generalization guarantees. \r\n\r\nWe can write the objective as:\r\n\r\n$$\r\n\\underset{X}{\\mathbb{E}} \\underset{\\sim\\_{l k}, a\\_{q \\mathcal{A}}}{\\mathbb{E}} \\sum_{b \\in\\left\\(a\\_{l k}, a\\_{q t}\\right\\)} \\mathcal{L}\\_{b}\\left(Y^{R}, f(X)\\right) \\text { s.t. } K L\\left(p^{d o\\left(a\\_{l k}\\right)}\\left(Y^{R} \\mid f(X)\\right), p^{d o\\left(a\\_{q t}\\right)}\\left(Y^{R} \\mid f(X)\\right)\\right) \\leq \\rho\r\n$$\r\n\r\nwhere $\\mathcal{L}$ is the proxy task loss and $K L$ is the Kullback-Leibler (KL) divergence. Note that any distance measure on distributions can be used in place of the KL divergence.\r\n\r\nConcretely, as proxy task we associate to every datapoint $x\\_{i}$ the label $y\\_{i}^{R}=i$. This corresponds to the instance discrimination task, commonly used in contrastive learning. We take pairs of points $\\left(x\\_{i}, x\\_{j}\\right)$ to compute similarity scores and use pairs of augmentations $a\\_{l k}=\\left(a\\_{l}, a\\_{k}\\right) \\in$ $\\mathcal{A} \\times \\mathcal{A}$ to perform a style intervention. Given a batch of samples $\\left\\(x\\_{i}\\right\\)\\_{i=1}^{N} \\sim \\mathcal{D}$, we use\r\n\r\n$$\r\np^{d o\\left(a\\_{l k}\\right)}\\left(Y^{R}=j \\mid f\\left(x\\_{i}\\right)\\right) \\propto \\exp \\left(\\phi\\left(f\\left(x\\_{i}^{a\\_{l}}\\right), h\\left(x\\_{j}^{a\\_{k}}\\right)\\right) / \\tau\\right)\r\n$$\r\n\r\nwith $x^{a}$ data augmented with $a$ and $\\tau$ a softmax temperature parameter. We encode $f$ using a neural network and choose $h$ to be related to $f$, e.g. $h=f$ or as a network with an exponential moving average of the weights of $f$ (e.g. target networks). To compare representations we use the function $\\phi\\left(f\\left(x\\_{i}\\right), h\\left(x\\_{j}\\right)\\right)=\\left\\langle g\\left(f\\left(x\\_{i}\\right)\\right), g\\left(h\\left(x\\_{j}\\right)\\right)\\right\\rangle$ where $g$ is a fully-connected neural network often called the critic.\r\n\r\nCombining these pieces, we learn representations by minimizing the following objective over the full set of data $x\\_{i} \\in \\mathcal{D}$ and augmentations $a_{l k} \\in \\mathcal{A} \\times \\mathcal{A}$\r\n\r\n$$\r\n-\\sum_{i=1}^{N} \\sum\\_{a\\_{l k}} \\log \\frac{\\exp \\left(\\phi\\left(f\\left(x\\_{i}^{a_{l}}\\right), h\\left(x\\_{i}^{a\\_{k}}\\right)\\right) / \\tau\\right)}{\\sum\\_{m=1}^{M} \\exp \\left(\\phi\\left(f\\left(x\\_{i}^{a\\_{l}}\\right), h\\left(x\\_{m}^{a\\_{k}}\\right)\\right) / \\tau\\right)}+\\alpha \\sum\\_{a\\_{l k}, a\\_{q t}} K L\\left(p^{d o\\left(a\\_{l k}\\right)}, p^{d o\\left(a\\_{q t}\\right)}\\right)\r\n$$\r\n\r\nwith $M$ the number of points we use to construct the contrast set and $\\alpha$ the weighting of the invariance penalty. The shorthand $p^{d o(a)}$ is used for $p^{d o(a)}\\left(Y^{R}=j \\mid f\\left(x\\_{i}\\right)\\right)$. The Figure shows a schematic of the RELIC objective.",
  "title": "Representation Learning via Invariant Causal Mechanisms",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "Strided EESP",
  "full_name": "Strided EESP",
  "description": "A **Strided EESP** unit is based on the [EESP Unit](https://paperswithcode.com/method/eesp) but is modified to learn representations more efficiently at multiple scales. Depth-wise dilated convolutions are given strides, an [average pooling](https://paperswithcode.com/method/average-pooling) operation is added instead of an identity connection, and the element-wise addition operation is replaced with a concatenation operation, which helps in expanding the dimensions of feature maps efficiently.",
  "title": "ESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural Network",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "IPBI",
  "full_name": "Instances-Pixels Balance Index",
  "description": "In a given dataset for semantic image segmentation, the number of samples per class should be the same, so that no classifier would be biased towards the majority class (here included the background). It is very difficult, if not impossible, to achieve a perfect balance between the several classes of objects of a dataset. Considering that the segmentation of the objects  is accomplished at the pixel level, the number of pixels for each class must be taken into account. As a matter of fact, in image semantic segmentation, \r\ndifferent classes and the background may have quite different\r\nsizes. Therefore, the image segmentation problem is naturally unbalanced. The IPBI is based on the concept of entropy, a common measure used in many fields of science. In a general sense, it measures the amount of disorder of a system. For the sake of semantic image segmentation, the ideal dataset should have the same number of instances per class, as well as the same number of pixels in all classes. Similar reasoning can be done considering the number of pixels of all samples in a class, so that we can obtain the\r\npixels balance measure for the dataset. Overall, IPBI evaluates the balance of pixels and number of instances of an image semantic segmentation dataset and, so, it is usefull to compare different datasets.",
  "title": "EPYNET: Efficient Pyramidal Network for Clothing Segmentation",
  "collection": "Image Semantic Segmentation Metric",
  "area": "Computer Vision"
}
{
  "name": "FCPose",
  "full_name": "FCPose",
  "description": "**FCPose** is a fully convolutional multi-person [pose estimation framework](https://paperswithcode.com/methods/category/pose-estimation-models) using dynamic instance-aware convolutions. Different from existing methods, which often require ROI (Region of Interest) operations and/or grouping post-processing, FCPose eliminates the ROIs and grouping pre-processing with dynamic instance aware keypoint estimation heads. The dynamic keypoint heads are conditioned on each instance (person), and can encode the instance concept in the dynamically-generated weights of their filters. \r\n\r\nOverall, FCPose is built upon the one-stage object detector [FCOS](https://paperswithcode.com/method/fcos). The controller that generates the weights of the keypoint heads is attached to the FCOS heads. The weights $\\theta\\_{i}$ generated by the controller is used to fulfill the keypoint head $f$ for the instance $i$. Moreover, a keypoint refinement module is introduced to predict the offsets from each location of the heatmaps to the ground-truth keypoints. Finally, the coordinates derived from the predicted heatmaps are refined by the offsets predicted by the keypoint refinement module, resulting in the final keypoint results. \"Rel. coord.\" is a map of the relative coordinates from all the locations of the feature maps $F$ to the location where the weights are generated. The relative coordinate map is concatenated to $F$ as the input to the keypoint head.",
  "title": "FCPose: Fully Convolutional Multi-Person Pose Estimation with Dynamic Instance-Aware Convolutions",
  "collection": "Pose Estimation Models",
  "area": "Computer Vision"
}
{
  "name": "MixNet",
  "full_name": "MixNet",
  "description": "**MixNet** is a type of convolutional neural network discovered via AutoML that utilises MixConvs instead of regular depthwise convolutions.",
  "title": "MixConv: Mixed Depthwise Convolutional Kernels",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Conditional Batch Normalization",
  "full_name": "Conditional Batch Normalization",
  "description": "**Conditional Batch Normalization (CBN)** is a class-conditional variant of [batch normalization](https://paperswithcode.com/method/batch-normalization). The key idea is to predict the $\\gamma$ and $\\beta$ of the batch normalization from an embedding - e.g. a language embedding in VQA. CBN enables the linguistic embedding to manipulate entire feature maps by scaling them up or down, negating them, or shutting them off. CBN has also been used in [GANs](https://paperswithcode.com/methods/category/generative-adversarial-networks) to allow class information to affect the batch normalization parameters.\r\n\r\nConsider a single convolutional layer with batch normalization module $\\text{BN}\\left(F\\_{i,c,h,w}|\\gamma\\_{c}, \\beta\\_{c}\\right)$ for which pretrained scalars $\\gamma\\_{c}$ and $\\beta\\_{c}$ are available. We would like to directly predict these affine scaling parameters from, e.g., a language embedding $\\mathbf{e\\_{q}}$. When starting the training procedure, these parameters must be close to the pretrained values to recover the original [ResNet](https://paperswithcode.com/method/resnet) model as a poor initialization could significantly deteriorate performance. Unfortunately, it is difficult to initialize a network to output the pretrained $\\gamma$ and $\\beta$. For these reasons, the authors propose to predict a change $\\delta\\beta\\_{c}$ and $\\delta\\gamma\\_{c}$ on the frozen original scalars, for which it is straightforward to initialize a neural network to produce an output with zero-mean and small variance.\r\n\r\nThe authors use a one-hidden-layer MLP to predict these deltas from a question embedding $\\mathbf{e\\_{q}}$ for all feature maps within the layer:\r\n\r\n$$\\Delta\\beta = \\text{MLP}\\left(\\mathbf{e\\_{q}}\\right)$$\r\n\r\n$$\\Delta\\gamma = \\text{MLP}\\left(\\mathbf{e\\_{q}}\\right)$$\r\n\r\nSo, given a feature map with $C$ channels, these MLPs output a vector of size $C$. We then add these predictions to the $\\beta$ and $\\gamma$ parameters:\r\n\r\n$$ \\hat{\\beta}\\_{c} = \\beta\\_{c} + \\Delta\\beta\\_{c} $$\r\n\r\n$$ \\hat{\\gamma}\\_{c} = \\gamma\\_{c} + \\Delta\\gamma\\_{c} $$\r\n\r\nFinally, these updated $\\hat{β}$ and $\\hat{\\gamma}$ are used as parameters for the batch normalization: $\\text{BN}\\left(F\\_{i,c,h,w}|\\hat{\\gamma\\_{c}}, \\hat{\\beta\\_{c}}\\right)$. The authors freeze all ResNet parameters, including $\\gamma$ and $\\beta$, during training. A ResNet consists of\r\nfour stages of computation, each subdivided in several residual blocks. In each block, the authors apply CBN to the three convolutional layers.",
  "title": "Modulating early visual processing by language",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "USE",
  "full_name": "Multilingual Universal Sentence Encoder",
  "description": "",
  "title": "Multilingual Universal Sentence Encoder for Semantic Retrieval",
  "collection": "Contextualized Word Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "GradientDICE",
  "full_name": "GradientDICE",
  "description": "**GradientDICE** is a density ratio learning method for estimating the density ratio between the state distribution of the target policy and the sampling distribution in off-policy reinforcement learning. It optimizes a different objective from [GenDICE](https://arxiv.org/abs/2002.09072) by using the Perron-Frobenius theorem and eliminating GenDICE’s use of divergence, such that nonlinearity in parameterization is not necessary for GradientDICE, which is provably convergent under linear function approximation.",
  "title": "GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values",
  "collection": "Density Ratio Learning",
  "area": "Reinforcement Learning"
}
{
  "name": "TPN",
  "full_name": "Temporal Pyramid Network",
  "description": "**Temporal Pyramid Network**, or **TPN**, is a pyramid level module for action recognition at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks in a plug-and-play manner. The source of features and the fusion of features form a feature hierarchy for the backbone so that it can capture action instances at various tempos. In the TPN, a Backbone Network is used to extract multiple level features, a Spatial Semantic Modulation spatially downsamples features to align semantics, a Temporal Rate Modulation temporally downsamples features to adjust relative tempo among levels, Information Flow aggregates features in various directions to enhance and enrich level-wise representations and Final Prediction rescales and concatenates all levels of pyramid along channel dimension.",
  "title": "Temporal Pyramid Network for Action Recognition",
  "collection": "Action Recognition Blocks",
  "area": "General"
}
{
  "name": "RAdam",
  "full_name": "RAdam",
  "description": "**Rectified Adam**, or **RAdam**, is a variant of the [Adam](https://paperswithcode.com/method/adam) stochastic optimizer that introduces a term to rectify the variance of the adaptive learning rate. It seeks to tackle the bad convergence problem suffered by Adam. The authors argue that the root cause of this behaviour is that the adaptive learning rate has undesirably large variance in the early stage of model training, due to the limited amount of training samples being used. Thus, to reduce such variance, it is better to use smaller learning rates in the first few epochs of training - which justifies the warmup heuristic. This heuristic motivates RAdam which rectifies the variance problem:\r\n\r\n$$g\\_{t} = \\nabla\\_{\\theta}f\\_{t}\\left(\\theta\\_{t-1}\\right) $$\r\n\r\n$$v\\_{t} = 1/\\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g^{2}\\_{t} $$\r\n\r\n$$m\\_{t} = \\beta\\_{1}m\\_{t-1} + \\left(1-\\beta\\_{1}\\right)g\\_{t} $$\r\n\r\n$$ \\hat{m\\_{t}} = m\\_{t} / \\left(1-\\beta^{t}\\_{1}\\right) $$\r\n\r\n$$ \\rho\\_{t} = \\rho\\_{\\infty} - 2t\\beta^{t}\\_{2}/\\left(1-\\beta^{t}\\_{2}\\right) $$\r\n\r\n$$\\rho_{\\infty} = \\frac{2}{1-\\beta_2} - 1$$ \r\n\r\nIf the variance is tractable - $\\rho\\_{t} > 4$ then:\r\n\r\n...the adaptive learning rate is computed as:\r\n\r\n$$ l\\_{t} = \\sqrt{\\left(1-\\beta^{t}\\_{2}\\right)/v\\_{t}}$$\r\n\r\n...the variance rectification term is calculated as:\r\n\r\n$$ r\\_{t} = \\sqrt{\\frac{(\\rho\\_{t}-4)(\\rho\\_{t}-2)\\rho\\_{\\infty}}{(\\rho\\_{\\infty}-4)(\\rho\\_{\\infty}-2)\\rho\\_{t}}}$$\r\n\r\n...and we update parameters with adaptive momentum:\r\n\r\n$$ \\theta\\_{t} = \\theta\\_{t-1} - \\alpha\\_{t}r\\_{t}\\hat{m}\\_{t}l\\_{t} $$\r\n\r\nIf the variance isn't tractable we update instead with:\r\n\r\n$$ \\theta\\_{t} = \\theta\\_{t-1} - \\alpha\\_{t}\\hat{m}\\_{t} $$",
  "title": "On the Variance of the Adaptive Learning Rate and Beyond",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "InterBERT",
  "full_name": "InterBERT",
  "description": "InterBERT aims to model interaction between information flows pertaining to different modalities. This new architecture builds multi-modal interaction and preserves the independence of single modal representation. InterBERT is built with an image embedding layer, a text embedding layer, a single-stream interaction module, and a two stream extraction module. The model is pre-trained with three tasks: 1) masked segment modeling, 2) masked region modeling, and 3) image-text matching.",
  "title": "InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "ConvTasNet",
  "full_name": "Convolutional time-domain audio separation network",
  "description": "Combines learned time-frequency representation with a masker architecture based on 1D [dilated convolution](https://paperswithcode.com/method/dilated-convolution).",
  "title": "Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation",
  "collection": "Temporal Convolutions",
  "area": "Sequential"
}
{
  "name": "Composite Fields",
  "full_name": "Composite Fields",
  "description": "Represent and associate with a composite of primitive fields.",
  "title": "PifPaf: Composite Fields for Human Pose Estimation",
  "collection": "Image Representations",
  "area": "Computer Vision"
}
{
  "name": "Spatial-Reduction Attention",
  "full_name": "Spatial-Reduction Attention",
  "description": "**Spatial-Reduction Attention**, or **SRA**, is a [multi-head attention](https://paperswithcode.com/method/multi-head-attention) module used in the [Pyramid Vision Transformer](https://paperswithcode.com/method/pvt) architecture which reduces the spatial scale of the key $K$ and value $V$ before the attention operation. This reduces the computational/memory overhead. Details of the SRA in the stage $i$ can be formulated as follows:\r\n\r\n$$\r\n\\text{SRA}(Q, K, V)=\\text { Concat }\\left(\\operatorname{head}\\_{0}, \\ldots \\text { head }\\_{N\\_{i}}\\right) W^{O} $$\r\n\r\n$$\\text{ head}\\_{j}=\\text { Attention }\\left(Q W\\_{j}^{Q}, \\operatorname{SR}(K) W\\_{j}^{K}, \\operatorname{SR}(V) W\\_{j}^{V}\\right)\r\n$$\r\n\r\nwhere Concat $(\\cdot)$ is the concatenation operation. $W\\_{j}^{Q} \\in \\mathbb{R}^{C\\_{i} \\times d\\_{\\text {head }}}$, $W\\_{j}^{K} \\in \\mathbb{R}^{C\\_{i} \\times d\\_{\\text {head }}}$, $W\\_{j}^{V} \\in \\mathbb{R}^{C\\_{i} \\times d\\_{\\text {head }}}$, and $W^{O} \\in \\mathbb{R}^{C\\_{i} \\times C\\_{i}}$ are linear projection parameters. $N\\_{i}$ is the head number of the attention layer in Stage $i$. Therefore, the dimension of each head (i.e. $\\left.d\\_{\\text {head }}\\right)$ is equal to $\\frac{C\\_{i}}{N\\_{i}} . \\text{SR}(\\cdot)$ is the operation for reducing the spatial dimension of the input sequence ($K$ or $V$ ), which is written as:\r\n\r\n$$\r\n\\text{SR}(\\mathbf{x})=\\text{Norm}\\left(\\operatorname{Reshape}\\left(\\mathbf{x}, R\\_{i}\\right) W^{S}\\right)\r\n$$\r\n\r\nHere, $\\mathbf{x} \\in \\mathbb{R}^{\\left(H\\_{i} W\\_{i}\\right) \\times C\\_{i}}$ represents a input sequence, and $R\\_{i}$ denotes the reduction ratio of the attention layers in Stage $i .$ Reshape $\\left(\\mathbf{x}, R\\_{i}\\right)$ is an operation of reshaping the input sequence $\\mathbf{x}$ to a sequence of size $\\frac{H\\_{i} W\\_{i}}{R\\_{i}^{2}} \\times\\left(R\\_{i}^{2} C\\_{i}\\right)$. $W\\_{S} \\in \\mathbb{R}^{\\left(R\\_{i}^{2} C\\_{i}\\right) \\times C\\_{i}}$ is a linear projection that reduces the dimension of the input sequence to $C\\_{i}$. $\\text{Norm}(\\cdot)$ refers to layer normalization.",
  "title": "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "Xavier Initialization",
  "full_name": "Xavier Initialization",
  "description": "**Xavier Initialization**, or **Glorot Initialization**, is an initialization scheme for neural networks. Biases are initialized be 0 and the weights $W\\_{ij}$ at each layer are initialized as:\r\n\r\n$$ W\\_{ij} \\sim U\\left[-\\frac{\\sqrt{6}}{\\sqrt{fan_{in} + fan_{out}}}, \\frac{\\sqrt{6}}{\\sqrt{fan_{in} + fan_{out}}}\\right] $$\r\n\r\nWhere $U$ is a uniform distribution and $fan_{in}$ is the size of the previous layer (number of columns in $W$) and $fan_{out}$ is the size of the current layer.",
  "title": null,
  "collection": "Initialization",
  "area": "General"
}
{
  "name": "RPN",
  "full_name": "Region Proposal Network",
  "description": "A **Region Proposal Network**, or **RPN**, is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals. RPN and algorithms like [Fast R-CNN](https://paperswithcode.com/method/fast-r-cnn) can be merged into a single network by sharing their convolutional features - using the recently popular terminology of neural networks with attention mechanisms, the RPN component tells the unified network where to look.\r\n\r\nRPNs are designed to efficiently predict region proposals with a wide range of scales and aspect ratios. RPNs use anchor boxes that serve as references at multiple scales and aspect ratios. The scheme can be thought of as a pyramid of regression references, which avoids enumerating images or filters of multiple scales or aspect ratios.",
  "title": "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks",
  "collection": "Region Proposal",
  "area": "Computer Vision"
}
{
  "name": "Fast Voxel Query",
  "full_name": "Fast Voxel Query",
  "description": "**Fast Voxel Query** is a module used in the [Voxel Transformer](https://paperswithcode.com/method/votr) 3D object detection model implementation of self-attention, specifically Local and Dilated Attention. For each querying index $v\\_{i}$, an attending voxel index $v\\_{j}$ is determined by Local and Dilated Attention. Then we can lookup the non-empty index $j$ in the hash table with hashed $v\\_{j}$ as the key. Finally, the non-empty index $j$ is used to gather the attending feature $f\\_{j}$ from $\\mathcal{F}$ for [multi-head attention](https://paperswithcode.com/method/multi-head-attention).",
  "title": "Voxel Transformer for 3D Object Detection",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Fast R-CNN",
  "full_name": "Fast R-CNN",
  "description": "**Fast R-CNN** is an object detection model that improves in its predecessor [R-CNN](https://paperswithcode.com/method/r-cnn) in a number of ways. Instead of extracting CNN features independently for each region of interest, Fast R-CNN aggregates them into a single forward pass over the image; i.e. regions of interest from the same image share computation and memory in the forward and backward passes.",
  "title": "Fast R-CNN",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "SpreadsheetCoder",
  "full_name": "SpreadsheetCoder",
  "description": "**SpreadsheetCoder** is a neural network architecture for spreadsheet formula prediction. It is a [BERT](https://paperswithcode.com/method/bert)-based model architecture to represent the tabular context in both row-based and column-based formats. A [BERT](https://paperswithcode.com/method/bert) encoder computes an embedding vector for each input token, incorporating the contextual information from nearby rows and columns. The BERT encoder is initialized from the weights pre-trained on English text corpora, which is beneficial for encoding table headers. To handle cell references, a two-stage decoding process is used inspired by sketch learning for program synthesis. The decoder first generates a formula sketch, which does not include concrete cell references, and then predicts the corresponding cell ranges to generate the complete formula",
  "title": "SpreadsheetCoder: Formula Prediction from Semi-structured Context",
  "collection": "Spreadsheet Formula Prediction Models",
  "area": "General"
}
{
  "name": "OA-Mix",
  "full_name": "Object-Aware Mix",
  "description": "**OA-Mix** is a general and effective data augmentation method for single-domain generalization in object detection. It increases image diversity while preserving important semantic features with\r\nmulti-level transformations and object-aware mixing.",
  "title": "Object-Aware Domain Generalization for Object Detection",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "Inverted Residual Block",
  "full_name": "Inverted Residual Block",
  "description": "An **Inverted Residual Block**, sometimes called an **MBConv Block**, is a type of residual block used for image models that uses an inverted structure for efficiency reasons. It was originally proposed for the [MobileNetV2](https://paperswithcode.com/method/mobilenetv2) CNN architecture. It has since been reused for several mobile-optimized CNNs.\r\n\r\nA traditional [Residual Block](https://paperswithcode.com/method/residual-block) has a wide -> narrow -> wide structure with the number of channels. The input has a high number of channels, which are compressed with a [1x1 convolution](https://paperswithcode.com/method/1x1-convolution). The number of channels is then increased again with a 1x1 [convolution](https://paperswithcode.com/method/convolution) so input and output can be added. \r\n\r\nIn contrast, an Inverted Residual Block follows a narrow -> wide -> narrow approach, hence the inversion. We first widen with a 1x1 convolution, then use a 3x3 [depthwise convolution](https://paperswithcode.com/method/depthwise-convolution) (which greatly reduces the number of parameters), then we use a 1x1 convolution to reduce the number of channels so input and output can be added.",
  "title": "MobileNetV2: Inverted Residuals and Linear Bottlenecks",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "PLIP",
  "full_name": "Pathology Language and Image Pre-Training",
  "description": "Pathology Language and Image Pre-Training (PLIP) is a vision-and-language foundation model created by fine-tuning CLIP on pathology images.",
  "title": "Leveraging medical Twitter to build a visual–language foundation model for pathology AI",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "Semantic Reasoning Network",
  "full_name": "Semantic Reasoning Network",
  "description": "**Semantic reasoning network**, or **SRN**, is an end-to-end trainable framework for scene text recognition that consists of four parts: backbone network, parallel [visual attention](https://paperswithcode.com/method/visual-attention) module (PVAM), global semantic reasoning module (GSRM), and visual-semantic fusion decoder (VSFD). Given an input image, the backbone network is first used to extract 2D features $V$. Then, the PVAM is used to generate $N$ aligned 1-D features $G$, where each feature corresponds to a character in the text and captures the aligned visual information. These $N$ 1-D features $G$ are then fed into a GSRM to capture the semantic information $S$. Finally, the aligned visual features $G$ and the semantic information $S$ are fused by the VSFD to predict $N$ characters. For text string shorter than $N$, ’EOS’ are padded.",
  "title": "Towards Accurate Scene Text Recognition with Semantic Reasoning Networks",
  "collection": "Scene Text Models",
  "area": "Computer Vision"
}
{
  "name": "SMITH",
  "full_name": "Siamese Multi-depth Transformer-based Hierarchical Encoder",
  "description": "**SMITH**, or **Siamese Multi-depth Transformer-based Hierarchical Encoder**, is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based model for document representation learning and matching. It contains several design choices to adapt [self-attention models](https://paperswithcode.com/methods/category/attention-modules) for long text inputs. For the model pre-training, a masked sentence block language modeling task is used in addition to the original masked word language model task used in [BERT](https://paperswithcode.com/method/bert), to capture sentence block relations within a document. Given a sequence of sentence block representation, the document level Transformers learn the contextual representation for each sentence block and the final document representation.",
  "title": "Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "AccoMontage",
  "full_name": "AccoMontage",
  "description": "**AccoMontage** is a model for accompaniment arrangement, a type of music generation task involving intertwined constraints of melody, harmony, texture, and music structure. AccoMontage generates piano accompaniments for folk/pop songs based on a lead sheet (i.e. a melody with chord progression). It first retrieves phrase montages from a database while recombining them structurally using dynamic programming. Second, chords of the retrieved phrases are manipulated to match the lead sheet via style transfer. Lastly, the system offers controls over the generation process. In contrast to pure deep learning approaches, AccoMontage uses a hybrid pathway, in which rule-based optimization and deep learning are both leveraged.",
  "title": "AccoMontage: Accompaniment Arrangement via Phrase Selection and Style Transfer",
  "collection": "Generative Audio Models",
  "area": "Audio"
}
{
  "name": "SPADE",
  "full_name": "Spatially-Adaptive Normalization",
  "description": "**SPADE**, or **Spatially-Adaptive Normalization** is a conditional normalization method for semantic image synthesis. Similar to [Batch Normalization](https://www.paperswithcode.com/method/batch-normalization), the activation is normalized in the channel-wise manner and then modulated with learned scale and bias. In the SPADE, the mask is first projected onto an embedding space and then convolved to produce the modulation parameters $\\gamma$ and $\\beta .$ Unlike prior conditional normalization methods, $\\gamma$ and $\\mathbf{\\beta}$ are not vectors, but tensors with spatial dimensions. The produced $\\gamma$ and $\\mathbf{\\beta}$ are multiplied and added to the normalized activation element-wise.",
  "title": "Semantic Image Synthesis with Spatially-Adaptive Normalization",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "DIoU-NMS",
  "full_name": "DIoU-NMS",
  "description": "**DIoU-NMS** is a type of non-maximum suppression where we use Distance IoU rather than regular DIoU, in which the overlap area and the distance between two central points of bounding boxes are simultaneously considered when suppressing redundant boxes.\r\n\r\nIn original NMS, the IoU metric is used to suppress the redundant detection boxes, where the overlap area is the unique factor, often yielding false suppression for the cases with occlusion. With DIoU-NMS, we not only consider the overlap area but also central point distance between two boxes.",
  "title": "Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression",
  "collection": "Proposal Filtering",
  "area": "Computer Vision"
}
{
  "name": "CP conv",
  "full_name": "Center-pivot convolution",
  "description": "",
  "title": "Hypercorrelation Squeeze for Few-Shot Segmentation",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "Compressive Transformer",
  "full_name": "Compressive Transformer",
  "description": "The **Compressive Transformer** is an extension to the [Transformer](https://paperswithcode.com/method/transformer) which maps past hidden activations (memories) to a smaller set of compressed representations (compressed memories). The Compressive Transformer uses the same attention mechanism over its set of memories and compressed memories, learning to query both its short-term granular memory and longer-term coarse memory. It builds on the ideas of [Transformer-XL](https://paperswithcode.com/method/transformer-xl) which maintains a memory of past activations at each layer to preserve a longer history of context. The Transformer-XL discards past activations when they become sufficiently old (controlled by the size of the memory). The key principle of the Compressive Transformer is to compress these old memories, instead of discarding them, and store them in an additional [compressed memory](https://paperswithcode.com/method/compressed-memory).\r\n\r\nAt each time step $t$, we discard the oldest compressed memories (FIFO) and then the oldest $n$ states from ordinary memory are compressed and shifted to the new slot in compressed memory. During training, the compressive memory component is optimized separately from the main language model (separate training loop).",
  "title": "Compressive Transformers for Long-Range Sequence Modelling",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "All-Attention Layer",
  "full_name": "All-Attention Layer",
  "description": "An **All-Attention Layer** is an attention module and layer for transformers that merges the self-attention and feedforward sublayers into a single unified attention layer. As opposed to the two-step mechanism of the [Transformer](https://paperswithcode.com/method/transformer) layer, it directly builds its representation from the context and a persistent memory block without going through a feedforward transformation. The additional persistent memory block stores, in the form of key-value vectors, information that does not depend on the context. In terms of parameters, these persistent key-value vectors replace the feedforward sublayer.",
  "title": "Augmenting Self-attention with Persistent Memory",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "LFME",
  "full_name": "Learning From Multiple Experts",
  "description": "**Learning From Multiple Experts** is a self-paced knowledge distillation framework that aggregates the knowledge from multiple 'Experts' to learn a unified student model. Specifically, the proposed framework involves two levels of adaptive learning schedules: Self-paced Expert Selection and Curriculum Instance Selection, so that the knowledge is adaptively transferred to the 'Student'. The self-paced expert selection automatically controls the impact of knowledge distillation from each expert, so that the learned student model will gradually acquire the knowledge from the experts, and finally exceed the expert. The curriculum instance selection, on the other hand, designs a curriculum for the unified model where the training samples are organized from easy to hard, so that the unified student model will receive a less challenging learning schedule, and gradually learns from easy to hard samples.",
  "title": "Learning From Multiple Experts: Self-paced Knowledge Distillation for Long-tailed Classification",
  "collection": "Knowledge Distillation",
  "area": "General"
}
{
  "name": "PGHI",
  "full_name": "Phase Gradient Heap Integration",
  "description": "Z. Průša, P. Balazs and P. L. Søndergaard, \"A Noniterative Method for Reconstruction of Phase From STFT Magnitude,\" in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 1154-1164, May 2017, doi: 10.1109/TASLP.2017.2678166.\r\nAbstract: A noniterative method for the reconstruction of the short-time fourier transform (STFT) phase from the magnitude is presented. The method is based on the direct relationship between the partial derivatives of the phase and the logarithm of the magnitude of the un-sampled STFT with respect to the Gaussian window. Although the theory holds in the continuous setting only, the experiments show that the algorithm performs well even in the discretized setting (discrete Gabor transform) with low redundancy using the sampled Gaussian window, the truncated Gaussian window and even other compactly supported windows such as the Hann window. Due to the noniterative nature, the algorithm is very fast and it is suitable for long audio signals. Moreover, solutions of iterative phase reconstruction algorithms can be improved considerably by initializing them with the phase estimate provided by the present algorithm. We present an extensive comparison with the state-of-the-art algorithms in a reproducible manner.\r\nURL: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7890450&isnumber=7895265",
  "title": null,
  "collection": "Phase Reconstruction",
  "area": "Audio"
}
{
  "name": "Cross-resolution features",
  "full_name": "Cross-resolution features",
  "description": "",
  "title": "EfficientPose: Scalable single-person pose estimation",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "ENet Initial Block",
  "full_name": "ENet Initial Block",
  "description": "The **ENet Initial Block** is an image model block used in the [ENet](https://paperswithcode.com/method/enet) semantic segmentation architecture. [Max Pooling](https://paperswithcode.com/method/max-pooling) is performed with non-overlapping 2 × 2 windows, and the [convolution](https://paperswithcode.com/method/convolution) has 13 filters, which sums up to 16 feature maps after concatenation. This is heavily inspired by Inception Modules.",
  "title": "ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "MDETR",
  "full_name": "MDETR",
  "description": "**MDETR** is an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. It utilizes a [transformer](https://paperswithcode.com/method/transformer)-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. The network is pre-trained on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. The network is then fine-tuned on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation.",
  "title": "MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "U-Net GAN",
  "full_name": "U-Net Generative Adversarial Network",
  "description": "In contrast to typical GANs, a U-Net GAN uses a segmentation network as the discriminator. This segmentation network predicts two classes: real and fake. In doing so, the discriminator gives the generator region-specific feedback. This discriminator design also enables a  [CutMix](https://paperswithcode.com/method/cutmix)-based consistency regularization on the two-dimensional output of the U-Net GAN discriminator, which further improves image synthesis quality.",
  "title": "A U-Net Based Discriminator for Generative Adversarial Networks",
  "collection": "Generative Adversarial Networks",
  "area": "Computer Vision"
}
{
  "name": "LeVIT",
  "full_name": "LeVIT",
  "description": "**LeVIT** is a hybrid neural network for fast inference image classification. LeViT is a stack of [transformer blocks](https://paperswithcode.com/method/transformer), with [pooling steps](https://paperswithcode.com/methods/category/pooling-operation) to reduce the resolution of the activation maps as in classical [convolutional architectures](https://paperswithcode.com/methods/category/convolutional-neural-networks). This replaces the uniform structure of a Transformer by a pyramid with pooling, similar to the [LeNet](https://paperswithcode.com/method/lenet) architecture",
  "title": "LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "Inpainting",
  "full_name": "Inpainting",
  "description": "Train a convolutional neural network to generate the contents of an arbitrary image region conditioned on its surroundings.",
  "title": "Context Encoders: Feature Learning by Inpainting",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "Linformer",
  "full_name": "Linformer",
  "description": "**Linformer** is a linear [Transformer](https://paperswithcode.com/method/transformer) that utilises a linear self-attention mechanism to tackle the self-attention bottleneck with [Transformer models](https://paperswithcode.com/methods/category/transformers). The original [scaled dot-product attention](https://paperswithcode.com/method/scaled) is decomposed into multiple smaller attentions through linear projections, such that the combination of these operations forms a low-rank factorization of the original attention.",
  "title": "Linformer: Self-Attention with Linear Complexity",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Dorylus",
  "full_name": "Dorylus",
  "description": "**Dorylus** is a distributed system for training graph neural networks which uses cheap CPU servers and Lambda threads. It scales to\r\nlarge billion-edge graphs with low-cost cloud resources.",
  "title": "Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "MODERN",
  "full_name": "Modulated Residual Network",
  "description": "**MODERN**, or **Modulated Residual Network**, is an architecture for [visual question answering](https://paperswithcode.com/task/visual-question-answering) (VQA). It employs [conditional batch normalization](https://paperswithcode.com/method/conditional-batch-normalization) to allow a linguistic embedding from an [LSTM](https://paperswithcode.com/method/lstm) to modulate the [batch normalization](https://paperswithcode.com/method/batch-normalization) parameters of a [ResNet](https://paperswithcode.com/method/resnet). This enables the linguistic embedding to manipulate entire feature maps by scaling them up or down, negating them, or shutting them off, etc.",
  "title": "Modulating early visual processing by language",
  "collection": "VQA Models",
  "area": "Computer Vision"
}
{
  "name": "N-step Returns",
  "full_name": "N-step Returns",
  "description": "**$n$-step Returns** are used for value function estimation in reinforcement learning. Specifically, for $n$ steps we can write the complete return as:\r\n\r\n$$ R\\_{t}^{(n)} = r\\_{t+1} + \\gamma{r}\\_{t+2} + \\cdots + \\gamma^{n-1}\\_{t+n} + \\gamma^{n}V\\_{t}\\left(s\\_{t+n}\\right) $$\r\n\r\nWe can then write an $n$-step backup, in the style of TD learning, as:\r\n\r\n$$ \\Delta{V}\\_{r}\\left(s\\_{t}\\right) = \\alpha\\left[R\\_{t}^{(n)} - V\\_{t}\\left(s\\_{t}\\right)\\right] $$\r\n\r\nMulti-step returns often lead to faster learning with suitably tuned $n$.\r\n\r\nImage Credit: Sutton and Barto, Reinforcement Learning",
  "title": null,
  "collection": "Value Function Estimation",
  "area": "Reinforcement Learning"
}
{
  "name": "Point-GNN",
  "full_name": "Point-GNN",
  "description": "**Point-GNN** is a graph neural network for detecting objects from a LiDAR point cloud. It predicts the category and shape of the object that each vertex in the graph belongs to. In Point-GNN, there is an auto-registration mechanism to reduce translation variance, as well as a box merging and scoring operation to combine detections from multiple vertices accurately.",
  "title": "Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud",
  "collection": "3D Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "MEND",
  "full_name": "MODEL EDITOR NETWORKS WITH GRADIENT DECOMPOSITION",
  "description": "",
  "title": "Fast Model Editing at Scale",
  "collection": "Meta-Learning Algorithms",
  "area": "General"
}
{
  "name": "R(2+1)D",
  "full_name": "R(2+1)D",
  "description": "A **R(2+1)D** convolutional neural network is a network for action recognition that employs [R(2+1)D](https://paperswithcode.com/method/2-1-d-convolution) convolutions in a [ResNet](https://paperswithcode.com/method/resnet) inspired architecture. The use of these convolutions over regular [3D Convolutions](https://paperswithcode.com/method/3d-convolution) reduces computational complexity, prevents overfitting, and introduces more non-linearities that allow for a better functional relationship to be modeled.",
  "title": "A Closer Look at Spatiotemporal Convolutions for Action Recognition",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "NVAE Encoder Residual Cell",
  "full_name": "NVAE Encoder Residual Cell",
  "description": "The **NVAE Encoder Residual Cell** is a [residual connection](https://paperswithcode.com/method/residual-connection) block used in the [NVAE](https://paperswithcode.com/method/nvae) architecture for the encoder. It applies two series of BN-[Swish](https://paperswithcode.com/method/swish)-Conv layers without changing the number of channels.",
  "title": "NVAE: A Deep Hierarchical Variational Autoencoder",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Contrastive Multiview Coding",
  "full_name": "Contrastive Multiview Coding",
  "description": "**Contrastive Multiview Coding (CMC)** is a self-supervised learning approach, based on [CPC](https://paperswithcode.com/method/contrastive-predictive-coding), that  learns representations that capture information shared between multiple sensory views. The core idea is to set an anchor view and the sample positive and negative data points from the other view and maximise agreement between positive pairs in learning from two views. Contrastive learning is used to build the embedding.",
  "title": "Contrastive Multiview Coding",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "NT-Xent",
  "full_name": "Normalized Temperature-scaled Cross Entropy Loss",
  "description": "**NT-Xent**, or **Normalized Temperature-scaled Cross Entropy Loss**, is a loss function. Let $\\text{sim}\\left(\\mathbf{u}, \\mathbf{v}\\right) = \\mathbf{u}^{T}\\mathbf{v}/||\\mathbf{u}|| ||\\mathbf{v}||$ denote the cosine similarity between two vectors $\\mathbf{u}$ and $\\mathbf{v}$. Then the loss function for a positive pair of examples $\\left(i, j\\right)$ is :\r\n\r\n$$ \\mathbb{l}\\_{i,j} = -\\log\\frac{\\exp\\left(\\text{sim}\\left(\\mathbf{z}\\_{i}, \\mathbf{z}\\_{j}\\right)/\\tau\\right)}{\\sum^{2N}\\_{k=1}\\mathcal{1}\\_{[k\\neq{i}]}\\exp\\left(\\text{sim}\\left(\\mathbf{z}\\_{i}, \\mathbf{z}\\_{k}\\right)/\\tau\\right)}$$\r\n\r\nwhere $\\mathcal{1}\\_{[k\\neq{i}]} \\in ${$0, 1$} is an indicator function evaluating to $1$ iff $k\\neq{i}$ and $\\tau$ denotes a temperature parameter. The final loss is computed across all positive pairs, both $\\left(i, j\\right)$ and $\\left(j, i\\right)$, in a mini-batch.\r\n\r\nSource: [SimCLR](https://paperswithcode.com/method/simclr)",
  "title": "Improved Deep Metric Learning with Multi-class N-pair Loss Objective",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "3-Augment",
  "full_name": "3-Augment",
  "description": "",
  "title": "DeiT III: Revenge of the ViT",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "MatrixNet",
  "full_name": "MatrixNet",
  "description": "**MatrixNet** is a scale and aspect ratio aware building block for object detection that seek to handle objects of different sizes and aspect ratios. They have several matrix layers, each layer handles an object of specific size and aspect ratio. They can be seen as an alternative to [FPNs](https://paperswithcode.com/method/fpn). While FPNs are capable of handling objects of different sizes, they do not have a solution for objects of different aspect ratios. Objects such as a high tower, a giraffe, or a knife introduce a design difficulty for FPNs: does one map these objects to layers according to their width or height? Assigning the object to a layer according to its larger dimension would result in loss of information along the smaller dimension due to aggressive downsampling, and vice versa. \r\n\r\nMatrixNets assign objects of different sizes and aspect ratios to layers such that object sizes within their assigned layers are close to uniform. This assignment allows a square output [convolution](https://paperswithcode.com/method/convolution) kernel to equally gather information about objects of all aspect ratios and scales. MatrixNets can be applied to any backbone, similar to FPNs. We denote this by appending a \"-X\" to the backbone, i.e. ResNet50-X.",
  "title": "MatrixNets: A New Scale and Aspect Ratio Aware Architecture for Object Detection",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "DropPath",
  "full_name": "DropPath",
  "description": "Just as [dropout](https://paperswithcode.com/method/dropout) prevents co-adaptation of activations, **DropPath** prevents co-adaptation of parallel paths in networks such as [FractalNets](https://paperswithcode.com/method/fractalnet) by randomly dropping operands of the join layers. This\r\ndiscourages the network from using one input path as an anchor and another as a corrective term (a\r\nconfiguration that, if not prevented, is prone to overfitting). Two sampling strategies are:\r\n\r\n- **Local**: a join drops each input with fixed probability, but we make sure at least one survives.\r\n- **Global**: a single path is selected for the entire network. We restrict this path to be a single\r\ncolumn, thereby promoting individual columns as independently strong predictors.",
  "title": "FractalNet: Ultra-Deep Neural Networks without Residuals",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "CCT",
  "full_name": "Compact Convolutional Transformers",
  "description": "**Compact Convolutional Transformers** utilize sequence pooling and replace the patch embedding with a convolutional embedding, allowing for better inductive bias and making positional embeddings optional. CCT achieves better accuracy than ViT-Lite (smaller ViTs) and increases the flexibility of the input parameters.",
  "title": "Escaping the Big Data Paradigm with Compact Transformers",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "classifier-guidance",
  "full_name": "classifier-guidance",
  "description": "",
  "title": "Diffusion Models Beat GANs on Image Synthesis",
  "collection": "Image Generation Models",
  "area": "Computer Vision"
}
{
  "name": "Flow Alignment Module",
  "full_name": "Flow Alignment Module",
  "description": "**Flow Alignment Module**, or **FAM**, is a flow-based align module for scene parsing to learn Semantic Flow between feature maps of adjacent levels and broadcast high-level features to high resolution features effectively and efficiently. The concept of Semantic Flow is inspired from optical flow, which is widely used in video processing task to represent the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by relative motion. The authors postulate that the relatinship between two feature maps of arbitrary resolutions from the same image can also be represented with the “motion” of every pixel from one feature map to the other one. Once precise Semantic Flow is obtained, the network is able to propagate semantic features with minimal information loss.\r\n\r\nIn the FAM module, the transformed high-resolution feature map are combined with the low-resolution feature map to generate the semantic flow field, which is utilized to warp the low-resolution feature map to high-resolution feature map.",
  "title": "Semantic Flow for Fast and Accurate Scene Parsing",
  "collection": "Semantic Segmentation Modules",
  "area": "Computer Vision"
}
{
  "name": "CharacterBERT",
  "full_name": "CharacterBERT",
  "description": "CharacterBERT is a variant of [BERT](https://paperswithcode.com/method/bert) that **drops the wordpiece system** and **replaces it with a CharacterCNN module** just like the one [ELMo](https://paperswithcode.com/method/elmo) uses to produce its first layer representation. This allows CharacterBERT to represent any input token without splitting it into wordpieces. Moreover, this frees BERT from the burden of a domain-specific wordpiece vocabulary which may not be suited to your domain of interest (e.g. medical domain). Finally, it allows the model to be more robust to noisy inputs.",
  "title": "CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "DeepLabv2",
  "full_name": "DeepLabv2",
  "description": "**DeepLabv2** is an architecture for semantic segmentation that build on [DeepLab](https://paperswithcode.com/method/deeplab) with an atrous [spatial pyramid pooling](https://paperswithcode.com/method/spatial-pyramid-pooling) scheme. Here we have parallel dilated convolutions with different rates applied in the input feature map, which are then fused together. As objects of the same class can have different sizes in the image, [ASPP](https://paperswithcode.com/method/aspp) helps to account for different object sizes.",
  "title": "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "Self-Calibrated Convolutions",
  "full_name": "Self-Calibrated Convolutions",
  "description": "Liu et al. presented self-calibrated convolution as a means to enlarge the receptive field at each spatial location. \r\n\r\nSelf-calibrated convolution is used together with a standard convolution. It first divides the input feature $X$ into $X_{1}$ and $X_{2}$ in the channel domain. The self-calibrated convolution first uses average pooling to reduce the input size and enlarge the receptive field:\r\n\\begin{align}\r\nT_{1} = AvgPool_{r}(X_{1}) \r\n\\end{align}\r\nwhere $r$ is the filter size and stride. Then a convolution is used to model the channel relationship and a bilinear interpolation operator $Up$ is used to upsample the feature map: \r\n\r\n\\begin{align}\r\nX'_{1} = \\text{Up}(Conv_2(T_1))\r\n\\end{align}\r\n\r\nNext, element-wise multiplication finishes the self-calibrated process:\r\n\r\n\\begin{align}\r\nY'_{1} = Conv_3(X_1) \\sigma(X_1 + X'_1)\r\n\\end{align}\r\n\r\nFinally, the output feature map of is formed:\r\n\\begin{align}\r\nY_{1} &= Conv_4(Y'_{1})\r\n\\end{align}\r\n\\begin{align}\r\nY_2 &= Conv_1(X_2)\r\n\\end{align}\r\n\\begin{align}\r\nY &= [Y_1; Y_2]\r\n\\end{align}\r\nSuch self-calibrated convolution can enlarge the receptive field of a network and improve its adaptability. It achieves excellent results in image classification and certain downstream tasks such as instance segmentation, object detection and keypoint detection.",
  "title": "Improving Convolutional Networks With Self-Calibrated Convolutions",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "TD Lambda",
  "full_name": "TD Lambda",
  "description": "**TD_INLINE_MATH_1** is a generalisation of **TD_INLINE_MATH_2** reinforcement learning algorithms, but it employs an [eligibility trace](https://paperswithcode.com/method/eligibility-trace) $\\lambda$ and $\\lambda$-weighted returns. The eligibility trace vector is initialized to zero at the beginning of the episode, and it is incremented on each time step by the value gradient, and then fades away by $\\gamma\\lambda$:\r\n\r\n$$ \\textbf{z}\\_{-1} = \\mathbf{0} $$\r\n$$ \\textbf{z}\\_{t} = \\gamma\\lambda\\textbf{z}\\_{t-1} + \\nabla\\hat{v}\\left(S\\_{t}, \\mathbf{w}\\_{t}\\right), 0 \\leq t \\leq T$$\r\n\r\nThe eligibility trace keeps track of which components of the weight vector contribute to recent state valuations. Here $\\nabla\\hat{v}\\left(S\\_{t}, \\mathbf{w}\\_{t}\\right)$ is the feature vector.\r\n\r\nThe TD error for state-value prediction is:\r\n\r\n$$ \\delta\\_{t} = R\\_{t+1} + \\gamma\\hat{v}\\left\\(S\\_{t+1}, \\mathbf{w}\\_{t}\\right) - \\hat{v}\\left(S\\_{t}, \\mathbf{w}\\_{t}\\right) $$\r\n\r\nIn **TD_INLINE_MATH_1**, the weight vector is updated on each step proportional to the scalar TD error and the vector eligibility trace:\r\n\r\n$$ \\mathbf{w}\\_{t+1} = \\mathbf{w}\\_{t} + \\alpha\\delta\\mathbf{z}\\_{t}  $$\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition",
  "title": null,
  "collection": "On-Policy TD Control",
  "area": "Reinforcement Learning"
}
{
  "name": "Deformable Convolution",
  "full_name": "Deformable Convolution",
  "description": "**Deformable convolutions** add 2D offsets to the regular grid sampling locations in the standard [convolution](https://paperswithcode.com/method/convolution). It enables free form deformation of the sampling grid. The offsets are learned from the preceding feature maps, via additional convolutional layers. Thus, the deformation is conditioned on the input features in a local, dense, and adaptive manner.",
  "title": "Deformable Convolutional Networks",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "GAGNN",
  "full_name": "Group-Aware Neural Network",
  "description": "**GAGNN**, or **Group-aware Graph Neural Network**, is a hierarchical model for nationwide city air quality forecasting. The model constructs a city graph and a city group graph to model the spatial and latent dependencies between cities, respectively. GAGNN introduces differentiable grouping network to discover the latent dependencies among cities and generate city groups. Based on the generated city groups, a group correlation encoding module is introduced to learn the correlations between them, which can effectively capture the dependencies between city groups. After the graph construction, GAGNN implements message passing mechanism to model the dependencies between cities and city groups.",
  "title": "Group-Aware Graph Neural Network for Nationwide City Air Quality Forecasting",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "DeepIR",
  "full_name": "DeepIR",
  "description": "**DeepIR**, or **Deep InfraRed image processing**, is a thermal image processing framework for recovering high quality images from a very small set of images captured with camera motion. Enhancement is achieved by noting that camera motion, which is usually a hinderance, can be exploited to our advantage to separate a sequence of images into the scene-dependent radiant flux, and a slowly changing scene-independent non-uniformity. DeepIR combines the physics of microbolometer sensors, with powerful regularization capabilities by neural network-based representations. DeepIR relies on the key observation that jittering a camera, while unwanted in visible domain, is highly desirable in the thermal domain as it allows an accurate separation of the sensor-specific non-uniformities from the scene’s radiant flux.",
  "title": "Thermal Image Processing via Physics-Inspired Deep Networks",
  "collection": "Thermal Image Processing Models",
  "area": "Computer Vision"
}
{
  "name": "Demon",
  "full_name": "Demon",
  "description": "**Decaying Momentum**, or **Demon**, is a stochastic optimizer motivated by decaying the total contribution of a gradient to all future updates. By decaying the momentum parameter, the total contribution of a gradient to all future updates is decayed. A particular gradient term $g\\_{t}$ contributes a total of  $\\eta\\sum\\_{i}\\beta^{i}$ of its \"energy\" to all future gradient updates, and this results in the geometric sum, $\\sum^{\\infty}\\_{i=1}\\beta^{i} = \\beta\\sum^{\\infty}\\_{i=0}\\beta^{i} = \\frac{\\beta}{\\left(1-\\beta\\right)}$. Decaying this sum results in the Demon algorithm. Letting $\\beta\\_{init}$ be the initial $\\beta$; then at the current step $t$ with total $T$ steps, the decay routine is given by solving the below for $\\beta\\_{t}$:\r\n\r\n$$ \\frac{\\beta\\_{t}}{\\left(1-\\beta\\_{t}\\right)} =  \\left(1-t/T\\right)\\beta\\_{init}/\\left(1-\\beta\\_{init}\\right)$$\r\n\r\nWhere $\\left(1-t/T\\right)$ refers to the proportion of iterations remaining. Note that Demon typically requires no hyperparameter tuning as it is usually decayed to $0$ or a small negative value at time \r\n$T$. Improved performance is observed by delaying the decaying. Demon can be applied to any gradient descent algorithm with a momentum parameter.",
  "title": "Demon: Improved Neural Network Training with Momentum Decay",
  "collection": "Momentum Rules",
  "area": "General"
}
{
  "name": "Aggregated Learning",
  "full_name": "Aggregated Learning",
  "description": "**Aggregated Learning (AgrLearn)** is a vector-quantization approach to learning neural network classifiers. It builds on an equivalence between IB learning and IB quantization and exploits the power of vector quantization, which is well known in information theory.",
  "title": "Aggregated Learning: A Vector-Quantization Approach to Learning Neural Network Classifiers",
  "collection": "Information Bottleneck",
  "area": "General"
}
{
  "name": "PISA",
  "full_name": "PrIme Sample Attention",
  "description": "**PrIme Sample Attention (PISA)** directs the training of object detection frameworks towards prime samples. These are samples that play a key role in driving the detection performance. The authors define Hierarchical Local Rank (HLR) as a metric of importance. Specifically, they use IoU-HLR to rank positive samples and ScoreHLR to rank negative samples in each mini-batch. This ranking strategy places the positive samples with highest IoUs around each object and the negative samples with highest scores in each cluster to the top of the ranked list and directs the focus of the training process to them via a simple re-weighting scheme. The authors also devise a classification-aware regression loss to jointly optimize the classification and regression branches. Particularly, this loss would suppress those samples with large regression loss, thus reinforcing the attention to prime samples.",
  "title": "Prime Sample Attention in Object Detection",
  "collection": "Prioritized Sampling",
  "area": "General"
}
{
  "name": "Grouped Convolution",
  "full_name": "Grouped Convolution",
  "description": "A **Grouped Convolution** uses a group of convolutions - multiple kernels per layer - resulting in multiple channel outputs per layer. This leads to wider networks helping a network learn a varied set of low level and high level features. The original motivation of using Grouped Convolutions in [AlexNet](https://paperswithcode.com/method/alexnet) was to distribute the model over multiple GPUs as an engineering compromise. But later, with models such as [ResNeXt](https://paperswithcode.com/method/resnext), it was shown this module could be used to improve classification accuracy. Specifically by exposing a new dimension through grouped convolutions, *cardinality* (the size of set of transformations), we can increase accuracy by increasing it.",
  "title": "ImageNet Classification with Deep Convolutional Neural Networks",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "Deep Voice 3",
  "full_name": "Deep Voice 3",
  "description": "**Deep Voice 3 (DV3)** is a fully-convolutional attention-based neural text-to-speech system. The Deep Voice 3 architecture consists of three components:\r\n\r\n- Encoder: A fully-convolutional encoder, which converts textual features to an internal\r\nlearned representation.\r\n\r\n- Decoder: A fully-convolutional causal decoder, which decodes the learned representation\r\nwith a multi-hop convolutional attention mechanism into a low-dimensional audio representation (mel-scale spectrograms) in an autoregressive manner.\r\n\r\n- Converter: A fully-convolutional post-processing network, which predicts final vocoder\r\nparameters (depending on the vocoder choice) from the decoder hidden states. Unlike the\r\ndecoder, the converter is non-causal and can thus depend on future context information.\r\n\r\nThe overall objective function to be optimized is a linear combination of the losses from the decoder and the converter. The authors separate decoder and converter and apply multi-task training, because it makes attention learning easier in practice. To be specific, the loss for mel-spectrogram prediction guides training of the attention mechanism, because the attention is trained with the gradients from mel-spectrogram prediction besides vocoder parameter prediction.",
  "title": "Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning",
  "collection": "Text-to-Speech Models",
  "area": "Audio"
}
{
  "name": "CLRNet",
  "full_name": "Convolutional LSTM based Residual Network",
  "description": "",
  "title": "A Convolutional LSTM based Residual Network for Deepfake Video Detection",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "STMDA-RetinaNet",
  "full_name": "Self training multi target domain adaptive RetinaNet",
  "description": "",
  "title": "A Multi Camera Unsupervised Domain Adaptation Pipeline for Object Detection in Cultural Sites through Adversarial Learning and Self-Training",
  "collection": "Domain Adaptation",
  "area": "General"
}
{
  "name": "Universal Probing",
  "full_name": "Massively multilingual probing based on Universal Dependencies",
  "description": "",
  "title": "Universal and Independent: Multilingual Probing Framework for Exhaustive Model Interpretation and Evaluation",
  "collection": "Interpretability",
  "area": "General"
}
{
  "name": "PCB",
  "full_name": "Part-based Convolutional Baseline",
  "description": "",
  "title": "Beyond Part Models: Person Retrieval with Refined Part Pooling (and a Strong Convolutional Baseline)",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "BigGAN",
  "full_name": "BigGAN",
  "description": "**BigGAN** is a type of generative adversarial network that was designed for scaling generation to high-resolution, high-fidelity images. It includes a number of incremental changes and innovations. The baseline and incremental changes are:\r\n\r\n- Using [SAGAN](https://paperswithcode.com/method/sagan) as a baseline with spectral norm. for G and D, and using [TTUR](https://paperswithcode.com/method/ttur).\r\n- Using a Hinge Loss [GAN](https://paperswithcode.com/method/gan) objective\r\n- Using class-[conditional batch normalization](https://paperswithcode.com/method/conditional-batch-normalization) to provide class information to G (but with linear projection not MLP.\r\n- Using a [projection discriminator](https://paperswithcode.com/method/projection-discriminator) for D to provide class information to D.\r\n- Evaluating with EWMA of G's weights, similar to ProGANs.\r\n\r\nThe innovations are:\r\n\r\n- Increasing batch sizes, which has a big effect on the Inception Score of the model.\r\n- Increasing the width in each layer leads to a further Inception Score improvement.\r\n- Adding skip connections from the latent variable $z$ to further layers helps performance.\r\n- A new variant of [Orthogonal Regularization](https://paperswithcode.com/method/orthogonal-regularization).",
  "title": "Large Scale GAN Training for High Fidelity Natural Image Synthesis",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "ENet Dilated Bottleneck",
  "full_name": "ENet Dilated Bottleneck",
  "description": "**ENet Dilated Bottleneck** is an image model block used in the [ENet](https://paperswithcode.com/method/enet) semantic segmentation architecture. It is the same as a regular [ENet Bottleneck](https://paperswithcode.com/method/enet-bottleneck) but employs dilated convolutions instead.",
  "title": "ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Adaptive Input Representations",
  "full_name": "Adaptive Input Representations",
  "description": "**Adaptive Input Embeddings** extend the [adaptive softmax](https://paperswithcode.com/method/adaptive-softmax) to input word representations. The factorization assigns more capacity to frequent words and reduces the capacity for less frequent words with the benefit of reducing overfitting to rare words.",
  "title": "Adaptive Input Representations for Neural Language Modeling",
  "collection": "Input Embedding Factorization",
  "area": "Natural Language Processing"
}
{
  "name": "STraTA",
  "full_name": "Self-Training with Task Augmentation",
  "description": "**STraTA**, or **Self-Training with Task Augmentation**, is a self-training approach that builds on two key ideas for effective leverage of unlabeled data. First, STraTA uses task augmentation, a technique that synthesizes a large amount of data for auxiliary-task fine-tuning from target-task unlabeling texts. Second, STRATA performs self-training by further fine-tuning the strong base model created by task augmentation on a broad distribution of pseudo-labeled data.\r\n\r\nIn task augmentation, we train an NLI data generation model and use it to synthesize a large amount of in-domain NLI training data for each given target task, which is then used for auxiliary (intermediate) fine-tuning. The self-training algorithm iteratively learns a better model using a concatenation of labeled and pseudo-labeled examples. At each iteration, we always start with the auxiliary-task model produced by task augmentation and train on a broad distribution of pseudo-labeled data.",
  "title": "STraTA: Self-Training with Task Augmentation for Better Few-shot Learning",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "FairMOT",
  "full_name": "FairMOT",
  "description": "**FairMOT** is a model for multi-object tracking which consists of two homogeneous branches to predict pixel-wise objectness scores and re-ID features. The achieved fairness between the tasks is used to achieve high levels of detection and tracking accuracy. The detection branch is implemented in an anchor-free style which estimates object centers and sizes represented as position-aware measurement maps. Similarly, the re-ID branch estimates a re-ID feature for each pixel to characterize the object centered at the pixel. Note that the two branches are completely homogeneous which essentially differs from the previous methods which perform detection and re-ID in a cascaded style. It is also worth noting that FairMOT operates on high-resolution feature maps of strides four while the previous anchor-based methods operate on feature maps of stride 32. The elimination of anchors as well as the use of high-resolution feature maps better aligns re-ID features to object centers which significantly improves the tracking accuracy.",
  "title": "FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking",
  "collection": "Multi-Object Tracking Models",
  "area": "Computer Vision"
}
{
  "name": "OSA (identity mapping + eSE)",
  "full_name": "OSA (identity mapping + eSE)",
  "description": "**One-Shot Aggregation with an Identity Mapping and eSE** is an image model block that extends [one-shot aggregation](https://paperswithcode.com/method/one-shot-aggregation) with a [residual connection](https://paperswithcode.com/method/residual-connection) and [effective squeeze-and-excitation block](https://paperswithcode.com/method/effective-squeeze-and-excitation-block). It is proposed as part of the [VoVNetV2](https://paperswithcode.com/method/vovnetv2) CNN architecture.\r\n\r\nThe module adds an identity mapping to the OSA module - the input path is connected to the end of an OSA module that is able to backpropagate the gradients of every OSA module in an end-to-end manner on each stage like a [ResNet](https://paperswithcode.com/method/resnet). Additionally, a [channel attention module](https://paperswithcode.com/method/channel-attention-module) - effective Squeeze-Excitation - is used which is like regular [squeeze-and-excitation](https://paperswithcode.com/method/squeeze-and-excitation-block) but uses only one FC layer with $C$ channels instead of two FCs without a channel dimension reduction, which maintains channel information.",
  "title": "CenterMask : Real-Time Anchor-Free Instance Segmentation",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "FFF",
  "full_name": "Fast Feedforward Networks",
  "description": "A log-time alternative to feedforward layers outperforming both the vanilla feedforward and mixture-of-experts approaches.",
  "title": "Fast Feedforward Networks",
  "collection": "Backbone Architectures",
  "area": "Computer Vision"
}
{
  "name": "Deformable Kernel",
  "full_name": "Deformable Kernel",
  "description": "A **Deformable Kernels** is a type of convolutional operator for deformation modeling. DKs learn free-form offsets on kernel coordinates to deform the original kernel space towards specific data modality, rather than recomposing data. This can directly adapt the effective receptive field (ERF) while leaving the receptive field untouched. They can be used as a drop-in replacement of rigid kernels. \r\n\r\nAs shown in the Figure, for each input patch, a local DK first generates a group of kernel offsets $\\{\\Delta \\mathcal{k}\\}$ from input feature patch using the light-weight generator $\\mathcal{G}$ (a 3$\\times$3 [convolution](https://paperswithcode.com/method/convolution) of rigid kernel). Given the original kernel weights $\\mathcal{W}$ and the offset group $\\{\\Delta \\mathcal{k}\\}$, DK samples a new set of kernel $\\mathcal{W}'$ using a bilinear sampler $\\mathcal{B}$. Finally, DK convolves the input feature map and the sampled kernels to complete the whole computation.",
  "title": "Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "GaAN",
  "full_name": "Gated Attention Networks",
  "description": "Gated Attention Networks (GaAN) is a new architecture for learning on graphs. Unlike the traditional multi-head attention mechanism, which equally consumes all attention heads, GaAN uses a convolutional sub-network to control each attention head’s importance.\r\n\r\nImage credit: [GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs](https://paperswithcode.com/paper/gaan-gated-attention-networks-for-learning-on)",
  "title": "GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "S-GCN",
  "full_name": "Spherical Graph Convolutional Network",
  "description": "",
  "title": "Spherical convolutions on molecular graphs for protein model quality assessment",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "Ghost Bottleneck",
  "full_name": "Ghost Bottleneck",
  "description": "A **Ghost BottleNeck** is a skip connection block, similar to the basic [residual block](https://paperswithcode.com/method/residual-block) in [ResNet](https://paperswithcode.com/method/resnet) in which several convolutional layers and shortcuts are integrated, but stacks [Ghost Modules](https://paperswithcode.com/method/ghost-module) instead (two stacked Ghost modules). It was proposed as part of the [GhostNet](https://paperswithcode.com/method/ghostnet) CNN architecture.\r\n\r\nThe first Ghost module acts as an expansion layer increasing the number of channels. The ratio between the number of the output channels and that of the input is referred to as the *expansion ratio*. The second Ghost module reduces the number of channels to match the shortcut path. Then the shortcut is connected between the inputs and the outputs of these two Ghost modules. The [batch normalization](https://paperswithcode.com/method/batch-normalization) (BN) and [ReLU](https://paperswithcode.com/method/relu) nonlinearity are applied after each layer, except that ReLU is not used after the second Ghost module as suggested by [MobileNetV2](https://paperswithcode.com/method/mobilenetv2). The Ghost bottleneck described above is for stride=1. As for the case where stride=2, the shortcut path is implemented by a downsampling layer and a [depthwise convolution](https://paperswithcode.com/method/depthwise-convolution) with stride=2 is inserted between the two Ghost modules. In practice, the primary [convolution](https://paperswithcode.com/method/convolution) in Ghost module here is [pointwise convolution](https://paperswithcode.com/method/pointwise-convolution) for its efficiency.",
  "title": "GhostNet: More Features from Cheap Operations",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "Residual Block",
  "full_name": "Residual Block",
  "description": "**Residual Blocks** are skip-connection blocks that learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. They were introduced as part of the [ResNet](https://paperswithcode.com/method/resnet) architecture.\r\n \r\nFormally, denoting the desired underlying mapping as $\\mathcal{H}({x})$, we let the stacked nonlinear layers fit another mapping of $\\mathcal{F}({x}):=\\mathcal{H}({x})-{x}$. The original mapping is recast into $\\mathcal{F}({x})+{x}$. The $\\mathcal{F}({x})$ acts like a residual, hence the name 'residual block'.\r\n\r\nThe intuition is that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers. Having skip connections allows the network to more easily learn identity-like mappings.\r\n\r\nNote that in practice, [Bottleneck Residual Blocks](https://paperswithcode.com/method/bottleneck-residual-block) are used for deeper ResNets, such as ResNet-50 and ResNet-101, as these bottleneck blocks are less computationally intensive.",
  "title": "Deep Residual Learning for Image Recognition",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "TridentNet Block",
  "full_name": "TridentNet Block",
  "description": "A **TridentNet Block** is a feature extractor used in object detection models. Instead of feeding in multi-scale inputs like the image pyramid, in a [TridentNet](https://paperswithcode.com/method/tridentnet) block we adapt the backbone network for different scales. These blocks create multiple scale-specific feature maps. With the help of dilated convolutions, different branches of trident blocks have the same network structure and share the\r\nsame parameters yet have different receptive fields. Furthermore, to avoid training objects with extreme scales, a scale-aware training scheme is employed to make each branch specific to a given scale range matching its receptive field. Weight sharing is used to prevent overfitting.",
  "title": "Scale-Aware Trident Networks for Object Detection",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "Local Augmentation",
  "full_name": "Local Augmentation",
  "description": "**Local Augmentation for Graph Neural Networks**, or **LA-GNN**, is a data augmentation technique that enhances node features by its local subgraph structures. Specifically, it learns the conditional distribution of the connected neighbors’ representations given the representation of the central node, which has an analogy with the [Skip-gram of word2vec](https://paperswithcode.com/method/skip-gram-word2vec) model that predicts the probability of the context given the central word. After augmenting the neighborhood, we concat the initial and the generated feature matrix as input for GNNs.",
  "title": "Local Augmentation for Graph Neural Networks",
  "collection": "Graph Data Augmentation",
  "area": "Graphs"
}
{
  "name": "Adaptive Masking",
  "full_name": "Adaptive Masking",
  "description": "**Adaptive Masking** is a type of attention mechanism that allows a model to learn its own context size to attend over. For each head in [Multi-Head Attention](https://paperswithcode.com/method/multi-head-attention), a masking function is added to control for the span of the attention. A masking function is a non-increasing function that maps a\r\ndistance to a value in $\\left[0, 1\\right]$. Adaptive masking takes the following soft masking function $m\\_{z}$ parametrized by a real value $z$ in $\\left[0, S\\right]$:\r\n\r\n$$ m\\_{z}\\left(x\\right) = \\min\\left[\\max\\left[\\frac{1}{R}\\left(R+z-x\\right), 0\\right], 1\\right] $$\r\n\r\nwhere $R$ is a hyper-parameter that controls its softness. The shape of this piecewise function as a function of the distance. This soft masking function is inspired by [Jernite et al. (2017)](https://arxiv.org/abs/1611.06188). The attention weights from are then computed on the masked span:\r\n\r\n$$ a\\_{tr} = \\frac{m\\_{z}\\left(t-r\\right)\\exp\\left(s\\_{tr}\\right)}{\\sum^{t-1}\\_{q=t-S}m\\_{z}\\left(t-q\\right)\\exp\\left(s\\_{tq}\\right)}$$\r\n\r\nA $\\mathcal{l}\\_{1}$ penalization is added on the parameters $z\\_{i}$ for each attention head $i$ of the model to the loss function:\r\n\r\n$$ L = - \\log{P}\\left(w\\_{1}, \\dots, w\\_{T}\\right) + \\frac{\\lambda}{M}\\sum\\_{i}z\\_{i} $$\r\n\r\nwhere $\\lambda > 0$ is the regularization hyperparameter, and $M$ is the number of heads in each\r\nlayer. This formulation is differentiable in the parameters $z\\_{i}$, and learnt jointly with the rest of the model.",
  "title": "Adaptive Attention Span in Transformers",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "ESPNetv2",
  "full_name": "ESPNetv2",
  "description": "**ESPNetv2** is a convolutional neural network that utilises group point-wise and depth-wise dilated separable convolutions to learn representations from a large effective receptive field with fewer FLOPs and parameters.",
  "title": "ESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural Network",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "G-GLN Neuron",
  "full_name": "G-GLN Neuron",
  "description": "A **G-GLN Neuron** is a type of neuron used in the [G-GLN](https://paperswithcode.com/method/g-gln) architecture. G-GLN. The key idea is that further representational power can be added to a weighted product of Gaussians via a contextual gating procedure. This is achieved by extending a weighted product of Gaussians model with an additional type of input called side information. The side information will be used by a neuron to select a weight vector to apply for a given example from a table of weight vectors. In typical applications to regression, the side information is defined as the (normalized) input features for an input example: i.e. $z=(x-\\bar{x}) / \\sigma\\_{x}$.\r\n\r\nMore formally, associated with each neuron is a context function $c: \\mathcal{Z} \\rightarrow \\mathcal{C}$, where $\\mathcal{Z}$ is the set of possible side information and $\\mathcal{C}=\\{0, \\ldots, k-1\\}$ for some $k \\in \\mathbb{N}$ is the context space. Each neuron $i$ is now parameterized by a weight matrix $W\\_{i}=\\left[w\\_{i, 0} \\ldots w\\_{i, k-1}\\right]^{\\top}$ with each row vector $w\\_{i j} \\in \\mathcal{W}$ for $0 \\leq j<k$. The context function $c$ is responsible for mapping side information $z \\in \\mathcal{Z}$ to a particular row $w\\_{i, c(z)}$ of $W_{i}$, which we then use to weight the Product of Gaussians. In other words, a G-GLN neuron can be defined by:\r\n\r\n$$\r\n\\operatorname{PoG}\\_{W}^{c}\\left(y ; f_{1}(\\cdot), \\ldots, f\\_{m}(\\cdot), z\\right):=\\operatorname{PoG}\\_{w^{c(z)}}\\left(y ; f\\_{1}(\\cdot), \\ldots, f\\_{m}(\\cdot)\\right)\r\n$$\r\n\r\nwith the associated loss function $-\\log \\left(\\operatorname{PoG}\\_{W}^{c}\\left(y ; f\\_{1}(y), \\ldots, f\\_{m}(y), z\\right)\\right)$ inheriting all the properties needed to apply Online Convex Programming.",
  "title": "Gaussian Gated Linear Networks",
  "collection": "Gated Linear Networks",
  "area": "General"
}
{
  "name": "Big-Little Module",
  "full_name": "Big-Little Module",
  "description": "**Big-Little Modules** are blocks for image models that have two branches: each of which represents a separate block from a deep model and a less deep counterpart. They were proposed as part of the [BigLittle-Net](https://paperswithcode.com/method/big-little-net) architecture. The two branches are fused with a linear combination and unit weights. These two branches are known as Big-Branch (more layers and channels at low resolutions) and Little-Branch (fewer layers and channels at high resolution).",
  "title": "Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "StoGCN",
  "full_name": "StoGCN",
  "description": "StoGCN is a control variate based algorithm which allow sampling an arbitrarily small neighbor size. Presents new theoretical guarantee for the algorithms to converge to a local optimum of GCN.",
  "title": "Stochastic Training of Graph Convolutional Networks with Variance Reduction",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "GCT",
  "full_name": "Gated Channel Transformation",
  "description": "GCT first collects global information by computing the l2-norm of each channel. Next, a learnable vector $ \\alpha $ is applied to scale the feature. Then a competition mechanism is adopted by channel normalization to interact between channels.\r\n\r\nUnlike previous methods, GCT first collects global information by computing the $l_{2}$-norm of each channel. \r\nNext, a learnable vector $\\alpha$ is applied to scale the feature.\r\nThen a competition mechanism is adopted by \r\nchannel normalization to interact between channels. \r\nLike other common normalization methods, \r\na learnable scale parameter $\\gamma$ and bias $\\beta$ are applied to \r\nrescale the normalization.\r\nHowever, unlike previous methods,\r\nGCT adopts tanh activation to control the attention vector.\r\nFinally, it not only multiplies the input by the attention vector but also adds an identity connection. GCT can be written as: \r\n\\begin{align}\r\n    s = F_\\text{gct}(X, \\theta) & = \\tanh (\\gamma CN(\\alpha \\text{Norm}(X)) + \\beta)\r\n\\end{align}\r\n\\begin{align}\r\n    Y & = s  X + X\r\n\\end{align}\r\n\r\nwhere $\\alpha$, $\\beta$ and $\\gamma$ are trainable parameters. $\\text{Norm}(\\cdot)$ indicates the $L2$-norm of each channel. $CN$ is  channel normalization.\r\n\r\nA GCT block has fewer parameters than an SE block, and as it is  lightweight, \r\n can be added after each convolutional layer of a CNN.",
  "title": "Gated Channel Transformation for Visual Recognition",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "M2Det",
  "full_name": "M2Det",
  "description": "**M2Det** is a one-stage object detection model that utilises a Multi-Level Feature Pyramid Network ([MLFPN](https://paperswithcode.com/method/mlfpn)) to extract features from the input image, and then similar to [SSD](https://paperswithcode.com/method/ssd), produces dense bounding boxes and category scores based on the learned features, followed by the non-maximum suppression (NMS) operation to produce the final results.",
  "title": "M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "QRNN",
  "full_name": "Quasi-Recurrent Neural Network",
  "description": "A **QRNN**, or **Quasi-Recurrent Neural Network**, is a type of recurrent neural network that alternates convolutional layers, which apply in parallel across timesteps, and a minimalist recurrent pooling function that applies in parallel across channels. Due to their increased parallelism, they can be up to 16 times faster at train and test time than [LSTMs](https://paperswithcode.com/method/lstm).\r\n\r\nGiven an input sequence $\\mathbf{X} \\in \\mathbb{R}^{T\\times{n}}$ of $T$ n-dimensional vectors $\\mathbf{x}\\_{1}, \\dots, \\mathbf{x}\\_{T}$, the convolutional subcomponent of a QRNN performs convolutions in the timestep dimension with a bank of $m$ filters, producing a sequence $\\mathbf{Z} \\in \\mathbb{R}^{T\\times{m}}$ of m-dimensional candidate vectors $\\mathbf{z}\\_{t}$. Masked convolutions are used so filters can not access information from future timesteps (implementing with left padding).\r\n\r\nAdditional convolutions are applied with separate filter banks to obtain sequences of vectors for the\r\nelementwise gates that are needed for the pooling function. While the candidate vectors are passed\r\nthrough a $\\tanh$ nonlinearity, the gates use an elementwise sigmoid. If the pooling function requires a\r\nforget gate $f\\_{t}$ and an output gate $o\\_{t}$ at each timestep, the full set of computations in the convolutional component is then:\r\n\r\n$$ \\mathbf{Z} = \\tanh\\left(\\mathbf{W}\\_{z} ∗ \\mathbf{X}\\right) $$\r\n$$ \\mathbf{F} = \\sigma\\left(\\mathbf{W}\\_{f} ∗ \\mathbf{X}\\right) $$\r\n$$ \\mathbf{O} = \\sigma\\left(\\mathbf{W}\\_{o} ∗ \\mathbf{X}\\right) $$\r\n\r\nwhere $\\mathbf{W}\\_{z}$, $\\mathbf{W}\\_{f}$, and $\\mathbf{W}\\_{o}$, each in $\\mathbb{R}^{k×n×m}$, are the convolutional filter banks and ∗ denotes a [masked convolution](https://paperswithcode.com/method/masked-convolution) along the timestep dimension.  Dynamic [average pooling](https://paperswithcode.com/method/average-pooling) by Balduzzi & Ghifary (2016) is used, which uses only a forget gate:\r\n\r\n$$ \\mathbf{h}\\_{t} = \\mathbf{f}\\_{t} \\odot{\\mathbf{h}\\_{t−1}} + \\left(1 − \\mathbf{f}\\_{t}\\right) \\odot{\f\\mathbf{z}\\_{t}} $$ \r\n\r\nWhich is denoted f-pooling. The function may also include an output gate:\r\n\r\n$$ \\mathbf{c}\\_{t} = \\mathbf{f}\\_{t} \\odot{\\mathbf{c}\\_{t−1}} + \\left(1 − \\mathbf{f}\\_{t}\\right) \\odot{\f\\mathbf{z}\\_{t}} $$ \r\n\r\n$$ \\mathbf{h}\\_{t} = \\mathbf{o}\\_{t} \\odot{\\mathbf{c}\\_{t}} $$\r\n\r\nWhich is denoted fo-pooling. Or the recurrence relation may include an independent input and forget gate:\r\n\r\n$$ \\mathbf{c}\\_{t} = \\mathbf{f}\\_{t} \\odot{\\mathbf{c}\\_{t−1}} + \\mathbf{i}\\_{t}\\odot{\f\\mathbf{z}\\_{t}} $$ \r\n\r\n$$ \\mathbf{h}\\_{t} = \\mathbf{o}\\_{t} \\odot{\\mathbf{c}\\_{t}} $$\r\n\r\nWhich is denoted ifo-pooling. In each case $h$ or $c$ is initialized to zero. The recurrent part sof these functions must be calculated for each timestep in the sequence, but parallelism along feature dimensions means evaluating them even over long sequences requires a negligible amount of computation time.\r\n\r\nA single QRNN layer thus performs an input-dependent pooling, followed by a gated linear combination of convolutional features. As with convolutional neural networks, two or more QRNN layers should be stacked to create a model with the capacity to approximate more complex functions.",
  "title": "Quasi-Recurrent Neural Networks",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "DiCENet",
  "full_name": "DiCENet",
  "description": "**DiCENet** is a convolutional neural network architecture that utilizes dimensional convolutions (and dimension-wise fusion). The dimension-wise convolutions apply light-weight convolutional filtering across each dimension of the input tensor while dimension-wise fusion efficiently combines these dimension-wise representations; allowing the [DiCE Unit](https://paperswithcode.com/method/dice-unit) in the network to efficiently encode spatial and channel-wise information contained in the input tensor.",
  "title": "DiCENet: Dimension-wise Convolutions for Efficient Networks",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "DeepSIM",
  "full_name": "DeepSIM",
  "description": "**DeepSIM** is a generative model for conditional image manipulation based on a single image. The network learns to map between a primitive representation of the image to the image itself. At manipulation time, the generator allows for making complex image changes by modifying the primitive input representation and mapping it through the network. The choice of a primitive representations has an impact on the ease and expressiveness of the manipulations and can be automatic (e.g. edges), manual, or hybrid such as edges on top of segmentations.",
  "title": "Image Shape Manipulation from a Single Augmented Training Sample",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "Local Relation Layer",
  "full_name": "Local Relation Layer",
  "description": "A **Local Relation Layer** is an image feature extractor that is an alternative to a [convolution](https://paperswithcode.com/method/convolution) operator. The intuition is that aggregation in convolution is basically a pattern matching process that applies fixed filters, which can be inefficient at modeling visual elements with varying spatial distributions. The local relation layer adaptively determines aggregation weights based on the compositional relationship of local pixel pairs. It is argued that, with this relational approach, it can composite visual elements into higher-level entities in a more efficient manner that benefits semantic inference.",
  "title": "Local Relation Networks for Image Recognition",
  "collection": "Image Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "DeepWalk",
  "full_name": "DeepWalk",
  "description": "**DeepWalk** learns embeddings (social representations) of a graph's vertices, by modeling a stream of short random walks. Social representations are latent features of the vertices that capture neighborhood similarity and community membership. These latent representations encode social relations in a continuous vector space with a relatively small number of dimensions. It generalizes neural language models to process a special language composed of a set of randomly-generated walks. \r\n\r\nThe goal is to learn a latent representation, not only a probability distribution of node co-occurrences, and so as to introduce a mapping function $\\Phi \\colon v \\in V \\mapsto \\mathbb{R}^{|V|\\times d}$.\r\nThis mapping $\\Phi$ represents the latent social representation associated with each vertex $v$ in the graph. In practice, $\\Phi$ is represented by a $|V| \\times d$ matrix of free parameters.",
  "title": "DeepWalk: Online Learning of Social Representations",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "BiGRU",
  "full_name": "Bidirectional GRU",
  "description": "A **Bidirectional GRU**, or **BiGRU**, is a sequence processing model that consists of two [GRUs](https://paperswithcode.com/method/gru). one taking the input in a forward direction, and the other in a backwards direction. It is a bidirectional recurrent neural network with only the input and forget gates.\r\n\r\nImage Source: *Rana R (2016). Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech.*",
  "title": null,
  "collection": "Bidirectional Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "Monte-Carlo Tree Search",
  "full_name": "Monte-Carlo Tree Search",
  "description": "**Monte-Carlo Tree Search** is a planning algorithm that accumulates value estimates obtained from Monte Carlo simulations in order to successively direct simulations towards more highly-rewarded trajectories. We execute MCTS after encountering each new state to select an agent's action for that state: it is executed again to select the action for the next state. Each execution is an iterative process that simulates many trajectories starting from the current state to the terminal state. The core idea is to successively focus multiple simulations starting at the current state by extending the initial portions of trajectories that have received high evaluations from earlier simulations.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning (2nd Edition)\r\n\r\nImage Credit: [Chaslot et al](https://www.aaai.org/Papers/AIIDE/2008/AIIDE08-036.pdf)",
  "title": null,
  "collection": "Heuristic Search Algorithms",
  "area": "Reinforcement Learning"
}
{
  "name": "SNet",
  "full_name": "SNet",
  "description": "**SNet** is a convolutional neural network architecture and object detection backbone used for the [ThunderNet](https://paperswithcode.com/method/thundernet) two-stage object detector. SNet uses ShuffleNetV2 basic blocks but replaces all 3×3 depthwise convolutions with 5×5 depthwise convolutions.",
  "title": "ThunderNet: Towards Real-time Generic Object Detection",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "VGG",
  "full_name": "VGG",
  "description": "**VGG** is a classical convolutional neural network architecture. It was based on an analysis of how to increase the depth of such networks. The network utilises small 3 x 3 filters. Otherwise the network is characterized by its simplicity: the only other components being pooling layers and a fully connected layer.\r\n\r\nImage: [Davi Frossard](https://www.cs.toronto.edu/frossard/post/vgg16/)",
  "title": "Very Deep Convolutional Networks for Large-Scale Image Recognition",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "PointNet",
  "full_name": "PointNet",
  "description": "**PointNet** provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. It directly takes point clouds as input and outputs either class labels for the entire input or per point segment/part labels for each point of the input.\r\n\r\nSource: [Qi et al.](https://arxiv.org/pdf/1612.00593v2.pdf)\r\n\r\nImage source: [Qi et al.](https://arxiv.org/pdf/1612.00593v2.pdf)",
  "title": "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation",
  "collection": "3D Representations",
  "area": "Computer Vision"
}
{
  "name": "DDParser",
  "full_name": "Baidu Dependency Parser",
  "description": "**DDParser**, or **Baidu Dependency Parser**, is a Chinese dependency parser trained on a large-scale manually labeled dataset called Baidu Chinese Treebank (DuCTB).\r\n\r\nFor inputs, for the $i$ th word, its input vector $e_{i}$ is the concatenation of the word embedding and character-level representation:\r\n\r\n$$\r\ne\\_{i}=e\\_{i}^{w o r d} \\oplus C h a r L S T M\\left(w\\_{i}\\right)\r\n$$\r\n\r\nWhere $\\operatorname{CharLSTM}\\left(w_{i}\\right)$ is the output vectors after feeding the character sequence into a [BiLSTM](https://paperswithcode.com/method/bilstm) layer. The experimental results on DuCTB dataset show that replacing POS tag embeddings with $\\operatorname{CharLSTM}\\left(w_{i}\\right)$ leads to the improvement.\r\n\r\nFor the BiLSTM encoder, three BiLSTM layers are employed over the input vectors for context encoding. Denote $r\\_{i}$ the output vector of the top-layer BiLSTM for $w\\_{i}$\r\n\r\nThe dependency parser of [Dozat and Manning](https://arxiv.org/abs/1611.01734) is used. Dimension-reducing MLPs are applied to each recurrent output vector $r\\_{i}$ before applying the biaffine transformation. Applying smaller MLPs to the recurrent output states before the biaffine classifier has the advantage of stripping away information not relevant to the current decision. Then biaffine attention is used both in the dependency arc classifier and relation classifier. The computations of all symbols in the Figure are shown below:\r\n\r\n$$\r\nh_{i}^{d-a r c}=M L P^{d-a r c}\\left(r_{i}\\right)\r\n$$\r\n$$\r\nh_{i}^{h-a r c}=M L P^{h-a r c}\\left(r_{i}\\right) \\\\\r\n$$\r\n$$\r\nh_{i}^{d-r e l}=M L P^{d-r e l}\\left(r_{i}\\right) \\\\\r\n$$\r\n$$\r\nh_{i}^{h-r e l}=M L P^{h-r e l}\\left(r_{i}\\right) \\\\\r\n$$\r\n$$\r\nS^{a r c}=\\left(H^{d-a r c} \\oplus I\\right) U^{a r c} H^{h-a r c} \\\\\r\n$$\r\n$$\r\nS^{r e l}=\\left(H^{d-r e l} \\oplus I\\right) U^{r e l}\\left(\\left(H^{h-r e l}\\right)^{T} \\oplus I\\right)^{T}\r\n$$\r\n\r\nFor the decoder, the first-order Eisner algorithm is used to ensure that the output is a projection tree. Based on the dependency tree built by biaffine parser, we get a word sequence through the in-order traversal of the tree. The output is a projection tree only if the word sequence is in order.",
  "title": "A Practical Chinese Dependency Parser Based on A Large-scale Dataset",
  "collection": "Dependency Parsers",
  "area": "Natural Language Processing"
}
{
  "name": "Mobile DenseNet",
  "full_name": "Mobile DenseNet",
  "description": "",
  "title": "EfficientPose: Scalable single-person pose estimation",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "GA-PID/NN-PID",
  "full_name": "GA-PID/NN-PID",
  "description": "The main control tasks in autonomous vehicles are steering (lateral) and speed (longitudinal) control. PID controllers are widely used in the industry because of their simplicity and good performance, but they are difficult to tune and need additional adaptation to control nonlinear systems with varying parameters. In this paper, the longitudinal control task is addressed by implementing adaptive PID control\r\nusing two different approaches: Genetic Algorithms (GA-PID) and then Neural Networks (NN-PID) respectively. The vehicle\r\nnonlinear longitudinal dynamics are modeled using Powertrain blockset library. Finally, simulations are performed to assess\r\nand compare the performance of the two controllers subject to external disturbances.",
  "title": null,
  "collection": "Control and Decision Systems",
  "area": "General"
}
{
  "name": "ResNet-RS",
  "full_name": "ResNet-RS",
  "description": "**ResNet-RS** is a family of [ResNet](https://paperswithcode.com/method/resnet) architectures that are 1.7x faster than [EfficientNets](https://paperswithcode.com/method/efficientnet) on TPUs, while achieving similar accuracies on ImageNet. The authors propose two new scaling strategies: (1) scale model depth in regimes where overfitting can occur (width scaling is preferable otherwise); (2) increase image resolution more slowly than previously recommended.\r\n\r\nAdditional improvements include the use of a [cosine learning rate schedule](https://paperswithcode.com/method/cosine-annealing), [label smoothing](https://paperswithcode.com/method/label-smoothing), [stochastic depth](https://paperswithcode.com/method/stochastic-depth), [RandAugment](https://paperswithcode.com/method/randaugment), decreased [weight decay](https://paperswithcode.com/method/weight-decay), [squeeze-and-excitation](https://paperswithcode.com/method/squeeze-and-excitation-block) and the use of the [ResNet-D](https://paperswithcode.com/method/resnet-d) architecture.",
  "title": "Revisiting ResNets: Improved Training and Scaling Strategies",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "AUCC",
  "full_name": "Area Under the ROC Curve for Clustering",
  "description": "The area under the receiver operating characteristics (ROC) Curve, referred to as AUC, is a well-known performance measure in the supervised learning domain. Due to its compelling features, it has been employed in a number of studies to evaluate and compare the performance of different classifiers. In this work, we explore AUC as a performance measure in the unsupervised learning domain, more specifically, in the context of cluster analysis. In particular, we elaborate on the use of AUC as an internal/relative measure of clustering quality, which we refer to as Area Under the Curve for Clustering (AUCC). We show that the AUCC of a given candidate clustering solution has an expected value under a null model of random clustering solutions, regardless of the size of the dataset and, more importantly, regardless of the number or the (im)balance of clusters under evaluation. In addition, we elaborate on the fact that, in the context of internal/relative clustering validation as we consider, AUCC is actually a linear transformation of the Gamma criterion from Baker and Hubert (1975), for which we also formally derive a theoretical expected value for chance clusterings. We also discuss the computational complexity of these criteria and show that, while an ordinary implementation of Gamma can be computationally prohibitive and impractical for most real applications of cluster analysis, its equivalence with AUCC actually unveils a much more efficient algorithmic procedure. Our theoretical findings are supported by experimental results. These results show that, in addition to an effective and robust quantitative evaluation provided by AUCC, visual inspection of the ROC curves themselves can be useful to further assess a candidate clustering solution from a broader, qualitative perspective as well.",
  "title": null,
  "collection": "Clustering",
  "area": "General"
}
{
  "name": "DimFuse",
  "full_name": "Dimension-wise Fusion",
  "description": "**Dimension-wise Fusion** is an image model block that attempts to capture global information by combining features globally. It is an alternative to point-wise [convolution](https://paperswithcode.com/method/convolution). A point-wise convolutional layer applies $D$ point-wise kernels $\\mathbf{k}\\_p \\in \\mathbb{R}^{3D \\times 1 \\times 1}$ and performs $3D^2HW$ operations to combine dimension-wise representations of $\\mathbf{Y_{Dim}} \\in \\mathbb{R}^{3D \\times H \\times W}$ and produce an output $\\mathbf{Y} \\in \\mathbb{R}^{D \\times H \\times W}$. This is computationally expensive. Dimension-wise fusion is an alternative that can allow us to combine representations of $\\mathbf{Y\\_{Dim}}$ efficiently. As illustrated in the Figure to the right, it factorizes the point-wise convolution in two steps: (1) local fusion and (2) global fusion.",
  "title": "DiCENet: Dimension-wise Convolutions for Efficient Networks",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "ROCKET",
  "full_name": "Random Convolutional Kernel Transform",
  "description": "Linear classifier using random convolutional kernels applied to time series.",
  "title": "ROCKET: Exceptionally fast and accurate time series classification using random convolutional kernels",
  "collection": "Time Series Analysis",
  "area": "Sequential"
}
{
  "name": "CMCL",
  "full_name": "Crossmodal Contrastive Learning",
  "description": "**CMCL**, or **Crossmodal Contrastive Learning**, is a method for unifying visual and textual representations into the same semantic space based on a large-scale corpus of image collections, text corpus and image-text pairs. The CMCL aligns the visual representations and textual representations, and unifies them into the same semantic space based on image-text pairs. As shown in the Figure, to facilitate different levels of semantic alignment between vision and language, a series of text rewriting techniques are utilized to improve the diversity of cross-modal information. Specifically, for an image-text pair, various positive examples and hard negative examples can be obtained by rewriting the original caption at different levels. Moreover, to incorporate more background information from the single-modal data, text and image retrieval are also applied to augment each image-text pair with various related texts and images. The positive pairs, negative pairs, related images and texts are learned jointly by CMCL. In this way, the model can effectively unify different levels of visual and textual representations into the same semantic space, and incorporate more single-modal knowledge to enhance each other.",
  "title": "UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "Temporal Jittering",
  "full_name": "Temporal Jittering",
  "description": "**Temporal Jittering** is a method used in deep learning for video, where we sample multiple training clips from each video with random start times during at every epoch.",
  "title": null,
  "collection": "Video Sampling",
  "area": "Computer Vision"
}
{
  "name": "CentripetalNet",
  "full_name": "CentripetalNet",
  "description": "**CentripetalNet** is a keypoint-based detector which uses centripetal shift to pair corner keypoints from the same instance. CentripetalNet predicts the position and the centripetal shift of the corner points and matches corners whose shifted results are aligned.",
  "title": "CentripetalNet: Pursuing High-quality Keypoint Pairs for Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "MobileViT",
  "full_name": "MobileViT",
  "description": "MobileViT is a vision transformer that is tuned to mobile phone",
  "title": "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "(2+1)D Convolution",
  "full_name": "(2+1)D Convolution",
  "description": "A **(2+1)D Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) used for action recognition convolutional neural networks, with a spatiotemporal volume. As opposed to applying a [3D Convolution](https://paperswithcode.com/method/3d-convolution) over the entire volume, which can be computationally expensive and lead to overfitting, a (2+1)D convolution splits computation into two convolutions: a spatial 2D convolution followed by a temporal 1D convolution.",
  "title": "A Closer Look at Spatiotemporal Convolutions for Action Recognition",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "Cycle Consistency Loss",
  "full_name": "Cycle Consistency Loss",
  "description": "**Cycle Consistency Loss** is a type of loss used for generative adversarial networks that performs unpaired image-to-image translation. It was introduced with the [CycleGAN](https://paperswithcode.com/method/cyclegan) architecture. For two domains $X$ and $Y$, we want to learn a mapping $G : X \\rightarrow Y$ and $F: Y \\rightarrow X$. We want to enforce the intuition that these mappings should be reverses of each other and that both mappings should be bijections. Cycle Consistency Loss encourages $F\\left(G\\left(x\\right)\\right) \\approx x$ and $G\\left(F\\left(y\\right)\\right) \\approx y$.  It reduces the space of possible mapping functions by enforcing forward and backwards consistency:\r\n\r\n$$ \\mathcal{L}\\_{cyc}\\left(G, F\\right) = \\mathbb{E}\\_{x \\sim p\\_{data}\\left(x\\right)}\\left[||F\\left(G\\left(x\\right)\\right) - x||\\_{1}\\right] + \\mathbb{E}\\_{y \\sim p\\_{data}\\left(y\\right)}\\left[||G\\left(F\\left(y\\right)\\right) - y||\\_{1}\\right] $$",
  "title": "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "SVPG",
  "full_name": "Stein Variational Policy Gradient",
  "description": "**Stein Variational Policy Gradient**, or **SVPG**, is a policy gradient based method in reinforcement learning that uses Stein Variational Gradient Descent to allow simultaneous exploitation and exploration of multiple policies. Unlike traditional policy optimization which attempts to learn a single policy, SVPG models a distribution of policy parameters, where samples from this distribution will represent strong policies.  SVPG optimizes this distribution of policy parameters with (relative) [entropy regularization](https://paperswithcode.com/method/entropy-regularization). The (relative) entropy term explicitly encourages exploration in the parameter space while also optimizing the expected utility of polices drawn from this distribution. Stein variational gradient descent (SVGD) is then used to optimize this distribution. SVGD leverages efficient deterministic dynamics to transport a set of particles to approximate given target posterior distributions. \r\n\r\nThe update takes the form:\r\n\r\n$$ $$\r\n\r\n$$ \\nabla\\theta\\_i = \\frac{1} {n}\\sum\\_{j=1}^n \\nabla\\_{\\theta\\_{j}} \\left(\\frac{1}{\\alpha} J(\\theta\\_{j}) + \\log q\\_0(\\theta\\_j)\\right)k(\\theta\\_j, \\theta\\_i) + \\nabla\\_{\\theta\\_j} k(\\theta\\_j, \\theta\\_i)$$\r\n\r\nNote that here the magnitude of $\\alpha$ adjusts the relative importance between the policy gradient and the prior term $\\nabla_{\\theta_j} \\left(\\frac{1}{\\alpha} J(\\theta_j) + \\log q_0(\\theta_j)\\right)k(\\theta_j, \\theta_i)$ and the repulsive term $\\nabla_{\\theta_j} k(\\theta_j, \\theta_i)$. The repulsive functional is used to diversify particles to enable parameter exploration. A suitable $\\alpha$ provides a good trade-off between exploitation and exploration. If $\\alpha$ is too large, the Stein gradient would only drive the particles to be consistent with the prior $q_0$. As $\\alpha \\to 0$, this algorithm is reduced to running $n$ copies of independent policy gradient algorithms, if $\\{\\theta_i\\}$ are initialized very differently. A careful annealing scheme of $\\alpha$ allows efficient exploration in the beginning of training and later focuses on exploitation towards the end of training.",
  "title": "Stein Variational Policy Gradient",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "XGrad-CAM",
  "full_name": "XGrad-CAM",
  "description": "**XGrad-CAM**, or **Axiom-based Grad-CAM**, is a class-discriminative visualization method and able to highlight the regions belonging to the objects of interest. Two axiomatic properties are introduced in the derivation of XGrad-CAM: Sensitivity and Conservation. In particular, the proposed XGrad-CAM is still a linear combination of feature maps, but able to meet the constraints of those two axioms.",
  "title": "Axiom-based Grad-CAM: Towards Accurate Visualization and Explanation of CNNs",
  "collection": "Explainable CNNs",
  "area": "Computer Vision"
}
{
  "name": "EfficientNetV2",
  "full_name": "EfficientNetV2",
  "description": "**EfficientNetV2** is a type convolutional neural network that has faster training speed and better parameter efficiency than [previous models](https://paperswithcode.com/method/efficientnet). To develop these models, the authors use a combination of training-aware [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) and scaling, to jointly optimize training speed. The models were searched from the search space enriched with new ops such as [Fused-MBConv](https://ai.googleblog.com/2019/08/efficientnet-edgetpu-creating.html).\r\n\r\nArchitecturally the main differences are:\r\n\r\n- EfficientNetV2 extensively uses both [MBConv](https://paperswithcode.com/method/inverted-residual-block)  and the newly added fused-MBConv in the early layers.\r\n- EfficientNetV2 prefers smaller expansion ratio for [MBConv](https://paperswithcode.com/method/inverted-residual-block) since smaller expansion ratios tend to have less memory access overhead.\r\n- EfficientNetV2 prefers smaller 3x3 kernel sizes, but it adds more layers to compensate the reduced receptive field resulted from the smaller kernel size. \r\n- EfficientNetV2 completely removes the last stride-1 stage in the original EfficientNet, wperhaps due to its large parameter size and memory access overhead.",
  "title": "EfficientNetV2: Smaller Models and Faster Training",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Spatial Attention-Guided Mask",
  "full_name": "Spatial Attention-Guided Mask",
  "description": "**A Spatial Attention-Guided Mask** is a module for [instance segmentation](https://paperswithcode.com/task/instance-segmentation) that predicts a segmentation mask on each detected box with a spatial attention map that helps to focus on informative pixels and suppress noise. The goal is to guide the mask head for spotlighting meaningful pixels and repressing uninformative ones. \r\n\r\nOnce features inside the predicted RoIs are extracted by [RoIAlign](https://paperswithcode.com/method/roi-align) with 14×14 resolution, those features are fed into four conv layers and the [spatial attention module](https://paperswithcode.com/method/spatial-attention-module) (SAM) sequentially. To exploit the spatial attention map $A\\_{sag}\\left(X\\_{i}\\right) \\in \\mathcal{R}^{1\\times{W}\\times{H}}$ as a feature descriptor given input feature map $X\\_{i} \\in \\mathcal{R}^{C×W×H}$, the SAM first generates pooled features $P\\_{avg}, P\\_{max} \\in \\mathcal{R}^{1\\times{W}\\times{H}}$ by both average and [max pooling](https://paperswithcode.com/method/max-pooling) operations respectively along the channel axis and aggregates them via concatenation. Then it is followed by a 3 × 3 conv layer and normalized by the sigmoid function. The computation process\r\nis summarized as follow:\r\n\r\n$$\r\nA\\_{sag}\\left(X\\_{i}\\right) = \\sigma\\left(F\\_{3\\times{3}}(P\\_{max} \\cdot P\\_{avg})\\right)\r\n$$\r\n\r\nwhere $\\sigma$ denotes the sigmoid function, $F\\_{3\\times{3}}$ is 3 × 3 conv layer and $\\cdot$ represents the concatenate operation. Finally, the attention guided feature map $X\\_{sag} ∈ \\mathcal{R}^{C\\times{W}\\times{H}}$ is computed as:\r\n\r\n$$\r\nX\\_{sag} = A\\_{sag}\\left(X\\_{i}\\right) \\otimes X\\_{i}\r\n$$\r\n\r\nwhere ⊗ denotes element-wise multiplication. After then, a 2 × 2 deconv upsamples the spatially attended feature map to 28 × 28 resolution. Lastly, a 1 × 1 conv is applied for predicting class-specific masks.",
  "title": "CenterMask : Real-Time Anchor-Free Instance Segmentation",
  "collection": "Mask Branches",
  "area": "Computer Vision"
}
{
  "name": "Source Hypothesis Transfer",
  "full_name": "Source Hypothesis Transfer",
  "description": "**Source Hypothesis Transfer**, or **SHOT**, is a representation learning framework for unsupervised domain adaptation. SHOT freezes the classifier module (hypothesis) of the source model and learns the target-specific feature extraction module by exploiting both information maximization and self-supervised pseudo-labeling to implicitly align representations from the target domains to the source hypothesis.",
  "title": "Do We Really Need to Access the Source Data? Source Hypothesis Transfer for Unsupervised Domain Adaptation",
  "collection": "Domain Adaptation",
  "area": "General"
}
{
  "name": "GAP-Layer",
  "full_name": "Spectral Gap Rewiring Layer",
  "description": "**TL;DR: GAP-Layer is a GNN Layer which is able to rewire a graph in an inductive an parameter-free way optimizing the spectral gap (minimizing or maximizing the bottleneck size), learning a differentiable way to compute the Fiedler vector and the Fiedler value of the graph.**\r\n\r\n## Summary\r\n **GAP-Layer** is a rewiring layer based on minimizing or maximizing the spectral gap (or graph bottleneck size) in an inductive way. Depending on the mining task we want to perform in our graph, we would like to maximize or minimize the size of the bottleneck, aiming to more connected or more separated communities. \r\n\r\n## GAP-Layer: Spectral Gap Rewiring\r\n\r\n#### Loss and derivatives using $\\mathbf{L}$ or $\\mathbf{\\cal L}$\r\nFor this explanation, we are going to suppose we want to minimize the spectral gap, i.e. make the graph bottleneck size smaller. For minimizing the spectral GAP we minimize this loss:\r\n\r\n$$\r\nL\\_{Fiedler} = \\|\\tilde{\\mathbf{A}}-\\mathbf{A}\\| \\_F + \\alpha(\\lambda\\_2)^2\r\n$$\r\n\r\nThe gradients of this cost function w.r.t each element of $\\mathbf{A}$ are not trivial. Depending on if we use the Laplacian, $\\mathbf{L}$, or the normalized Laplacian, $\\cal L$, the derivatives are going to be different. For the former case ($\\mathbf{L}$), we will use the derivatives presented in Kang et al. 2019. In the latter scenario ($\\cal L$), we present the **Spectral Gradients**: derivatives from the spectral gap w.r.t. the Normalized Laplacian. However, whatever option we choose, $\\lambda_2$ can seen as a function of  $\\tilde{\\mathbf{A}}$ and , hence, $\\nabla\\_{\\tilde{\\mathbf{A}}}\\lambda\\_2$, the gradient of $\\lambda\\_2$ wrt each component of $\\tilde{\\mathbf{A}}$ (*how does the bottleneck change with each change in our graph?*),  comes from the chain rule of the matrix derivative $Tr\\left[\\left(\\nabla\\_{\\tilde{\\mathbf{L}}}\\lambda\\_2\\right)^T\\cdot\\nabla\\_{\\tilde{\\mathbf{A}}}\\tilde{\\mathbf{L}}\\right]$ if using the Laplacian or $Tr\\left[\\left(\\nabla\\_{\\tilde{\\mathbf{\\cal L}}}\\lambda\\_2\\right)^T\\cdot\\nabla\\_{\\tilde{\\mathbf{A}}}\\tilde{\\mathbf{\\cal L}}\\right]$ if using the normalized Laplacian. Both of this derivatives, relies on the Fiedler vector (2nd eigenvector: $\\mathbf{f}\\_2$ if we use $\\mathbf{L}$ and $\\mathbf{g}\\_2$ if using $\\mathbf{\\cal L}$ instead). For more details on those derivatives, and for the sake of simplicity in this blog explanation, I suggest go to the original paper.\r\n\r\n#### Differentiable approximation of $\\mathbf{f}_2$ and $\\lambda_2$\r\nOnce we have those derivatives, the problem is still not that trivial. Note that our cost function $L\\_{Fiedler}$, relies on an eigenvalue $\\lambda\\_2$. In addition, the derivatives also depends on the Fiedler vector $\\mathbf{f}\\_2$ or $\\mathbf{g}\\_2$, which is the eigenvector corresponding to the aforementioned eigenvalue. However, we **DO NOT COMPUTE IT SPECTRALLY**, as its computation has a complexity of $O(n^3)$ and would need to be computed in every learning iteration. Instead, **we learn an approximation of $\\mathbf{f}\\_2$ and use its Dirichlet energy ${\\cal E}(\\mathbf{f}\\_2)$ to approximate the $\\lambda_2$**. \r\n$$\r\n\\mathbf{f}\\_2(u) =  \\begin{array}{cl}\r\n       +1/\\sqrt{n}  & \\text{if}\\;\\; u\\;\\; \\text{belongs to the first cluster} \\\\\r\n       -1/\\sqrt{n}  & \\text{if}\\;\\; u\\;\\; \\text{belongs to the second cluster} \r\n\\end{array} \r\n$$\r\nIn addition, if using $\\mathbf{\\cal L}$, since $\\mathbf{g}\\_2=\\mathbf{D}^{1/2}\\mathbf{f}_2$, we first approximate $\\mathbf{g}_2$ and then approximate $\\lambda_2$ from ${\\cal E}(\\mathbf{g}\\_2)$. With this approximation, we can easily compute the node belonging to each cluster with a simple MLP. In addition, such as the Fiedler value must satisfy orthogonality and normality, restrictions must be added to that MLP Clustering.\r\n\r\n### GAP-Layer\r\nTo sum up, **GAP-Layer** can be defined as the following. Given the matrix $\\mathbf{X}\\_{n\\times F}$ encoding the features of the nodes after any message passing (MP) layer, $\\mathbf{S}\\_{n\\times 2}=\\textrm{Softmax}(\\textrm{MLP}(\\mathbf{X}))$ learns the association $\\mathbf{X}\\rightarrow \\mathbf{S}$ while $\\mathbf{S}$ is optimized according to the loss:\r\n  \r\n$$\r\nL\\_{Cut} = -\\frac{Tr[\\mathbf{S}^T\\mathbf{A}\\mathbf{S}]}{Tr[\\mathbf{S}^T\\mathbf{D}\\mathbf{S}]} + \\left\\|\\frac{\\mathbf{S}^T\\mathbf{S}}{\\|\\mathbf{S}^T\\mathbf{S}\\|\\_F} - \\frac{\\mathbf{I}\\_n}{\\sqrt{2}}\\right\\|\\_F\r\n$$\r\nThen, the $\\mathbf{f}\\_2$ is approximated from $\\mathbf{S}$ using $\\mathbf{f}\\_2(u)$ equation. Once calculated  $\\mathbf{f}\\_2$ and  $\\lambda\\_2$ we consider the loss:\r\n\r\n$$\r\nL\\_{Fiedler} = \\|\\tilde{\\mathbf{A}}-\\mathbf{A}\\|\\_F + \\alpha(\\lambda\\_2)^2\r\n$$\r\n$$\\mathbf{\\tilde{A}} = \\mathbf{A} - \\mu \\nabla_\\mathbf{\\tilde{A}}\\lambda\\_2$$\r\nreturning $\\tilde{\\mathbf{A}}$. Then the GAP diffusion $\\mathbf{T}^{GAP} = \\tilde{\\mathbf{A}}(\\mathbf{S}) \\odot \\mathbf{A}$ results from minimizing \r\n\r\n$$L_{GAP}= L\\_{Cut} + L\\_{Fiedler}$$\r\n\r\n\r\n**References**\r\n(Kang et al. 2019) Kang, J., & Tong, H. (2019, November). N2n: Network derivative mining. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (pp. 861-870).",
  "title": "DiffWire: Inductive Graph Rewiring via the Lovász Bound",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "Active Convolution",
  "full_name": "Active Convolution",
  "description": "An **Active Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) which does not have a fixed shape of the receptive field, and can be used to take more diverse forms of receptive fields for convolutions. Its shape can be learned through backpropagation during training. It can be seen as a generalization of convolution; it can define not only all conventional convolutions, but also convolutions with fractional pixel coordinates. We can freely change the shape of the convolution, which provides greater freedom to form CNN structures. Second, the shape of the convolution is learned while training and there is no need to tune it by hand",
  "title": "Active Convolution: Learning the Shape of Convolution for Image Classification",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "Mechanism Transfer",
  "full_name": "Mechanism Transfer",
  "description": "**Mechanism Transfer** is a meta-distributional scenario for few-shot domain adaptation in which a data generating mechanism is invariant across domains. This transfer assumption can accommodate nonparametric shifts resulting in apparently different distributions while providing a solid statistical basis for domain adaptation.",
  "title": "Few-shot Domain Adaptation by Causal Mechanism Transfer",
  "collection": "Domain Adaptation",
  "area": "General"
}
{
  "name": "FT-Transformer",
  "full_name": "FT-Transformer",
  "description": "FT-Transformer (Feature Tokenizer + Transformer) is a simple adaptation of the [Transformer](/method/transformer) architecture for the tabular domain. The model (Feature Tokenizer component) transforms all features (categorical and numerical) to tokens and runs a stack of Transformer layers over the tokens, so every Transformer layer operates on the feature level of one object. (This model is similar to [AutoInt](/method/autoint)). In the Transformer component, the `[CLS]` token is appended to $T$. Then $L$ Transformer layers are applied. PreNorm is used for easier optimization and good performance. The final representation of the `[CLS]` token is used for prediction.",
  "title": "Revisiting Deep Learning Models for Tabular Data",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "DPG",
  "full_name": "Deterministic Policy Gradient",
  "description": "**Deterministic Policy Gradient**, or **DPG**, is a policy gradient method for reinforcement learning. Instead of the policy function $\\pi\\left(.\\mid{s}\\right)$ being modeled as a probability distribution, DPG considers and calculates gradients for a deterministic policy $a = \\mu\\_{theta}\\left(s\\right)$.",
  "title": null,
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "DMA",
  "full_name": "Dual Multimodal Attention",
  "description": "In image inpainting task, the mechanism extracts complementary features from the word embedding in two paths by reciprocal attention, which is done by comparing the descriptive text and complementary image areas through reciprocal attention.",
  "title": "Text-Guided Neural Image Inpainting",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "ConvLSTM",
  "full_name": "ConvLSTM",
  "description": "**ConvLSTM** is a type of recurrent neural network for spatio-temporal prediction that has convolutional structures in both the input-to-state and state-to-state transitions. The ConvLSTM determines the future state of a certain cell in the grid by the inputs and past states of its local neighbors. This can easily be achieved by using a [convolution](https://paperswithcode.com/method/convolution) operator in the state-to-state and input-to-state transitions (see Figure). The key equations of ConvLSTM are shown  below, where $∗$ denotes the convolution operator and $\\odot$ the Hadamard product:\r\n\r\n$$ i\\_{t} = \\sigma\\left(W\\_{xi} ∗ X\\_{t} + W\\_{hi} ∗ H\\_{t−1} + W\\_{ci} \\odot \\mathcal{C}\\_{t−1} + b\\_{i}\\right) $$\r\n\r\n$$ f\\_{t} = \\sigma\\left(W\\_{xf} ∗ X\\_{t} + W\\_{hf} ∗ H\\_{t−1} + W\\_{cf} \\odot \\mathcal{C}\\_{t−1} + b\\_{f}\\right) $$\r\n\r\n$$ \\mathcal{C}\\_{t} = f\\_{t} \\odot \\mathcal{C}\\_{t−1} + i\\_{t} \\odot \\text{tanh}\\left(W\\_{xc} ∗ X\\_{t} + W\\_{hc} ∗ \\mathcal{H}\\_{t−1} + b\\_{c}\\right) $$\r\n\r\n$$ o\\_{t} = \\sigma\\left(W\\_{xo} ∗ X\\_{t} + W\\_{ho} ∗ \\mathcal{H}\\_{t−1} + W\\_{co} \\odot \\mathcal{C}\\_{t} + b\\_{o}\\right) $$\r\n\r\n$$ \\mathcal{H}\\_{t} = o\\_{t} \\odot \\text{tanh}\\left(C\\_{t}\\right) $$\r\n\r\nIf we view the states as the hidden representations of moving objects, a ConvLSTM with a larger transitional kernel should be able to capture faster motions while one with a smaller kernel can capture slower motions. \r\n\r\nTo ensure that the states have the same number of rows and same number of columns as the inputs, padding is needed before applying the convolution operation. Here, padding of the hidden states on the boundary points can be viewed as using the state of the outside world for calculation. Usually, before the first input comes, we initialize all the states of the [LSTM](https://paperswithcode.com/method/lstm) to zero which corresponds to \"total ignorance\" of the future.",
  "title": "Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "Layer Normalization",
  "full_name": "Layer Normalization",
  "description": "Unlike [batch normalization](https://paperswithcode.com/method/batch-normalization), **Layer Normalization** directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any new dependencies between training cases. It works well for [RNNs](https://paperswithcode.com/methods/category/recurrent-neural-networks) and improves both the training time and the generalization performance of several existing RNN models. More recently, it has been used with [Transformer](https://paperswithcode.com/methods/category/transformers) models.\r\n\r\nWe compute the layer normalization statistics over all the hidden units in the same layer as follows:\r\n\r\n$$ \\mu^{l} = \\frac{1}{H}\\sum^{H}\\_{i=1}a\\_{i}^{l} $$\r\n\r\n$$ \\sigma^{l} = \\sqrt{\\frac{1}{H}\\sum^{H}\\_{i=1}\\left(a\\_{i}^{l}-\\mu^{l}\\right)^{2}}  $$\r\n\r\nwhere $H$ denotes the number of hidden units in a layer. Under layer normalization, all the hidden units in a layer share the same normalization terms $\\mu$ and $\\sigma$, but different training cases have different normalization terms. Unlike batch normalization, layer normalization does not impose any constraint on the size of the mini-batch and it can be used in the pure online regime with batch size 1.",
  "title": "Layer Normalization",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "Deep Sets",
  "full_name": "Deep Sets",
  "description": "",
  "title": "Deep Sets",
  "collection": "Backbone Architectures",
  "area": "Computer Vision"
}
{
  "name": "PP-YOLO",
  "full_name": "PP-YOLO",
  "description": "**PP-YOLO** is an object detector based on [YOLOv3](https://paperswithcode.com/method/yolov3). It mainly tries to combine various existing tricks that almost not increase the number of model parameters and FLOPs, to achieve the goal of improving the accuracy of detector as much as possible while ensuring that the speed is almost unchanged. Some of these changes include:\r\n\r\n- Changing the [DarkNet-53](https://paperswithcode.com/method/darknet-53) backbone with ResNet50-vd. Some of the convolutional layers in ResNet50-vd are also replaced with [deformable convolutional layers](https://paperswithcode.com/method/deformable-convolution).\r\n- A larger batch size is used - changing from 64 to 192.\r\n- An exponentially moving average is used for the parameters.\r\n- [DropBlock](https://paperswithcode.com/method/dropblock) is applied to the [FPN](https://paperswithcode.com/method/fpn).\r\n- An IoU loss is used.\r\n- An IoU prediction branch is added to measure the accuracy of localization.\r\n- [Grid Sensitive](https://paperswithcode.com/method/grid-sensitive) is used, similar to [YOLOv4](https://paperswithcode.com/method/yolov4).\r\n- [Matrix NMS](https://paperswithcode.com/method/matrix-nms) is used.\r\n- [CoordConv](https://paperswithcode.com/method/coordconv) is used for the [FPN](https://paperswithcode.com/method/fpn), replacing the 1x1 convolution layer, and also the first convolution layer in the detection head.\r\n- [Spatial Pyramid Pooling](https://paperswithcode.com/method/spatial-pyramid-pooling) is used for the top feature map.",
  "title": "PP-YOLO: An Effective and Efficient Implementation of Object Detector",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "DCN-V2",
  "full_name": "DCN-V2",
  "description": "**DCN-V2** is an architecture for learning-to-rank that improves upon the original [DCN](http://paperswithcode.com/method/dcn) model. It first learns explicit feature interactions of the inputs (typically the embedding layer) through cross layers, and then combines with a deep network to learn complementary implicit interactions. The core of DCN-V2 is the cross layers, which inherit the simple structure of the cross network from DCN, however it is significantly more expressive at learning explicit and bounded-degree cross features.",
  "title": "DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems",
  "collection": "Learning to Rank Models",
  "area": "General"
}
{
  "name": "Multi-Head Attention",
  "full_name": "Multi-Head Attention",
  "description": "**Multi-head Attention** is a module for attention mechanisms which runs through an attention mechanism several times in parallel. The independent attention outputs are then concatenated and linearly transformed into the expected dimension. Intuitively, multiple attention heads allows for attending to parts of the sequence differently (e.g. longer-term dependencies versus shorter-term dependencies). \r\n\r\n$$ \\text{MultiHead}\\left(\\textbf{Q}, \\textbf{K}, \\textbf{V}\\right) = \\left[\\text{head}\\_{1},\\dots,\\text{head}\\_{h}\\right]\\textbf{W}_{0}$$\r\n\r\n$$\\text{where} \\text{ head}\\_{i} = \\text{Attention} \\left(\\textbf{Q}\\textbf{W}\\_{i}^{Q}, \\textbf{K}\\textbf{W}\\_{i}^{K}, \\textbf{V}\\textbf{W}\\_{i}^{V} \\right) $$\r\n\r\nAbove $\\textbf{W}$ are all learnable parameter matrices.\r\n\r\nNote that [scaled dot-product attention](https://paperswithcode.com/method/scaled) is most commonly used in this module, although in principle it can be swapped out for other types of attention mechanism.\r\n\r\nSource: [Lilian Weng](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html#a-family-of-attention-mechanisms)",
  "title": "Attention Is All You Need",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "RepPoints",
  "full_name": "RepPoints",
  "description": "**RepPoints** is a representation for object detection that consists of a set of points which indicate the spatial extent of an object and semantically significant local areas. This representation is learned via weak localization supervision from rectangular ground-truth boxes and implicit recognition feedback. Based on the richer RepPoints representation, the authors develop an anchor-free object detector that yields improved performance compared to using bounding boxes.",
  "title": "RepPoints: Point Set Representation for Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "BTF",
  "full_name": "Back to the Feature",
  "description": "",
  "title": "Back to the Feature: Classical 3D Features are (Almost) All You Need for 3D Anomaly Detection",
  "collection": "Point Cloud Representations",
  "area": "Computer Vision"
}
{
  "name": "Generalized Mean Pooling",
  "full_name": "Generalized Mean Pooling",
  "description": "**Generalized Mean Pooling (GeM)** computes the generalized mean of each channel in a tensor. Formally:\r\n\r\n$$ \\textbf{e} = \\left[\\left(\\frac{1}{|\\Omega|}\\sum\\_{u\\in{\\Omega}}x^{p}\\_{cu}\\right)^{\\frac{1}{p}}\\right]\\_{c=1,\\cdots,C} $$\r\n\r\nwhere $p > 0$ is a parameter. Setting this exponent as $p > 1$ increases the contrast of the pooled feature map and focuses on the salient features of the image. GeM is a generalization of the [average pooling](https://paperswithcode.com/method/average-pooling) commonly used in classification networks ($p = 1$) and of spatial max-pooling layer ($p = \\infty$).\r\n\r\nSource: [MultiGrain](https://paperswithcode.com/method/multigrain)\r\n\r\nImage Source: [Eva Mohedano](https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.slideshare.net%2Fxavigiro%2Fd1l5-contentbased-image-retrieval-upc-2018-deep-learning-for-computer-vision&psig=AOvVaw2-9Hx23FNGFDe4GHU22Oo5&ust=1591798200590000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCOiP-9P09OkCFQAAAAAdAAAAABAD)",
  "title": null,
  "collection": "Pooling Operations",
  "area": "Computer Vision"
}
{
  "name": "Location Sensitive Attention",
  "full_name": "Location Sensitive Attention",
  "description": "**Location Sensitive Attention** is an attention mechanism that extends the [additive attention mechanism](https://paperswithcode.com/method/additive-attention) to use cumulative attention weights from previous decoder time steps as an additional feature. This encourages the model to move forward consistently through the input, mitigating potential failure modes where some subsequences are repeated or ignored by the decoder.\r\n\r\nStarting with additive attention where $h$ is a sequential representation from a BiRNN encoder and ${s}\\_{i-1}$ is the $(i − 1)$-th state of a recurrent neural network (e.g. a [LSTM](https://paperswithcode.com/method/lstm) or [GRU](https://paperswithcode.com/method/gru)):\r\n\r\n$$ e\\_{i, j} = w^{T}\\tanh\\left(W{s}\\_{i-1} + Vh\\_{j} + b\\right) $$\r\n\r\nwhere $w$ and $b$ are vectors, $W$ and $V$ are matrices. We extend this to be location-aware by making it take into account the alignment produced at the previous step. First, we extract $k$ vectors\r\n$f\\_{i,j} \\in \\mathbb{R}^{k}$ for every position $j$ of the previous alignment $\\alpha\\_{i−1}$ by convolving it with a matrix $F \\in R^{k\\times{r}}$:\r\n\r\n$$ f\\_{i} = F ∗ \\alpha\\_{i−1} $$\r\n\r\nThese additional vectors $f\\_{i,j}$ are then used by the scoring mechanism $e\\_{i,j}$:\r\n\r\n$$ e\\_{i,j} = w^{T}\\tanh\\left(Ws\\_{i−1} + Vh\\_{j} + Uf\\_{i,j} + b\\right) $$",
  "title": "Attention-Based Models for Speech Recognition",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "myGym",
  "full_name": "MyGym: Modular Toolkit for Visuomotor Robotic Tasks",
  "description": "We introduce myGym, a toolkit suitable for fast prototyping of neural networks in the area of robotic manipulation and navigation. Our toolbox is fully modular, enabling users to train their algorithms on different robots, environments, and tasks. We also include pretrained neural network modules for the real-time vision that allows training visuomotor tasks with sim2real transfer. The visual modules can be easily retrained using the dataset generation pipeline with domain augmentation and randomization. Moreover, myGym provides automatic evaluation methods and baselines that help the user to directly compare their trained model with the state-of-the-art algorithms. We additionally present a novel metric, called learnability, to compare the general learning capability of algorithms in different settings, where the complexity of the environment, robot, and the task is systematically manipulated. The learnability score tracks differences between the performance of algorithms in increasingly challenging setup conditions, and thus allows the user to compare different models in a more systematic fashion. The code is accessible at https://github.com/incognite-lab/myGym",
  "title": null,
  "collection": "Robotic Manipulation Models",
  "area": "General"
}
{
  "name": "Filter Response Normalization",
  "full_name": "Filter Response Normalization",
  "description": "**Filter Response Normalization (FRN)** is a type of normalization that combines normalization and an activation function, which can be used as a replacement for other normalizations and activations. It operates on each activation channel of each batch element independently, eliminating the dependency on other batch elements. \r\n\r\nTo demonstrate, assume we are dealing with the feed-forward convolutional neural network. We follow the usual convention that the filter responses (activation maps) produced after a [convolution](https://paperswithcode.com/method/convolution) operation are a [4D ](https://paperswithcode.com/method/4d-a)tensor $X$ with shape $[B, W, H, C]$, where $B$ is the mini-batch size, $W, H$ are the spatial extents of the map, and $C$ is the number of filters used in convolution. $C$ is also referred to as output channels. Let $x = X_{b,:,:,c} \\in \\mathcal{R}^{N}$, where $N = W \\times H$, be the vector of filter responses for the $c^{th}$ filter for the $b^{th}$ batch point. \r\nLet $\\nu^2 = \\sum\\_i x_i^2/N$, be the mean squared norm of $x$. \r\n\r\nThen Filter Response Normalization is defined as the following:\r\n\r\n$$\r\n\\hat{x} = \\frac{x}{\\sqrt{\\nu^2 + \\epsilon}},\r\n$$\r\n\r\nwhere $\\epsilon$ is a small positive constant to prevent division by zero.  \r\n\r\nA lack of mean centering in FRN can lead to activations having an arbitrary bias away from zero. Such a bias in conjunction with [ReLU](https://paperswithcode.com/method/relu) can have a detrimental effect on learning and lead to poor performance and dead units. To address this the authors augment ReLU with a learned threshold $\\tau$ to yield:\r\n\r\n$$\r\nz = \\max(y, \\tau)\r\n$$\r\n\r\nSince $\\max(y, \\tau){=}\\max(y-\\tau,0){+}\\tau{=}\\text{ReLU}{(y{-}\\tau)}{+}\\tau$, the effect of this activation is the same as having a shared bias before and after ReLU.",
  "title": "Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "ORB-SLAM2",
  "full_name": "ORB-Simultaneous localization and mapping",
  "description": "ORB-SLAM2 is a complete SLAM system for monocular, stereo and RGB-D cameras, including map reuse, loop closing and relocalization capabilities. The system works in real-time on standard CPUs in a wide variety of environments from small hand-held indoors sequences, to drones flying in industrial environments and cars driving around a city.\r\n\r\nSource: [Mur-Artal and Tardos](https://arxiv.org/pdf/1610.06475v2.pdf)\r\n\r\nImage source: [Mur-Artal and Tardos](https://arxiv.org/pdf/1610.06475v2.pdf)",
  "title": "ORB-SLAM2: an Open-Source SLAM System for Monocular, Stereo and RGB-D Cameras",
  "collection": "Localization Models",
  "area": "Computer Vision"
}
{
  "name": "TabTransformer",
  "full_name": "TabTransformer",
  "description": "**TabTransformer** is a deep tabular data modeling architecture for supervised and semi-supervised learning. The TabTransformer is built upon self-attention based Transformers. The Transformer layers transform the embeddings of categorical features into robust contextual embeddings to achieve higher prediction accuracy. \r\n\r\nAs an overview, the architecture comprises a column embedding layer, a stack of $N$ [Transformer](/method/transformer) layers, and a multi-layer perceptron (MLP). The contextual embeddings (outputted by the Transformer layer) are concatenated along with continuous features which is inputted to an MLP. The loss function is then minimized  to learn all the parameters in an end-to-end learning.",
  "title": "TabTransformer: Tabular Data Modeling Using Contextual Embeddings",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "DEXTR",
  "full_name": "Deep Extreme Cut",
  "description": "**DEXTR**, or **Deep Extreme Cut**, obtains an object segmentation from its four extreme points: the left-most, right-most, top, and bottom pixels. The annotated extreme points are given as a guiding signal to the input of the network. To this end, we create a [heatmap](https://paperswithcode.com/method/heatmap) with activations in the regions of extreme points. We center a 2D Gaussian around each of the points, in order to create a single heatmap. The heatmap is concatenated with the RGB channels of the input image, to form a 4-channel input for the CNN. In order to focus on the object of interest, the input is cropped by the bounding box, formed from the extreme point annotations. To include context on the resulting\r\ncrop, we relax the tight bounding box by several pixels. After the pre-processing step that comes exclusively from the extreme clicks, the input consists of an RGB crop including an object, plus its extreme points. \r\n\r\n[ResNet](https://paperswithcode.com/method/resnet)-101 is chosen as backbone of the architecture. We remove the fully connected layers as well as the [max pooling](https://paperswithcode.com/method/max-pooling) layers in the last two stages to preserve acceptable output resolution for dense prediction, and we introduce atrous convolutions in the last two stages to maintain the same receptive field. After the last ResNet-101 stage, we introduce a pyramid scene parsing module to aggregate global context to the final feature map. The output of the CNN is a probability map representing whether a pixel belongs to the object that we want to segment or not. The CNN is trained to minimize the standard cross entropy loss, which takes into account that different classes occur with different frequency in a dataset.",
  "title": "Deep Extreme Cut: From Extreme Points to Object Segmentation",
  "collection": "Image Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "3D ResNet-RS",
  "full_name": "3D ResNet-RS",
  "description": "**3D ResNet-RS** is an architecture and scaling strategy for 3D ResNets for video recognition. The key additions are:\r\n\r\n- **3D ResNet-D stem**: The [ResNet-D](https://paperswithcode.com/method/resnet-d) stem is adapted to 3D inputs by using three consecutive [3D convolutional layers](https://paperswithcode.com/method/3d-convolution). The first convolutional layer employs a temporal kernel size of 5 while the remaining two convolutional layers employ a temporal kernel size of 1.\r\n\r\n- **3D Squeeze-and-Excitation**:  [Squeeze-and-Excite](https://paperswithcode.com/method/squeeze-and-excitation-block) is adapted to spatio-temporal inputs by using a 3D [global average pooling](https://paperswithcode.com/method/global-average-pooling) operation for the squeeze operation. A SE ratio of 0.25 is applied in each 3D bottleneck block for all experiments.\r\n\r\n- **Self-gating**: A self-gating module is used in each 3D bottleneck block after the SE module.",
  "title": "Revisiting 3D ResNets for Video Recognition",
  "collection": "Video Recognition Models",
  "area": "Computer Vision"
}
{
  "name": "LayerDrop",
  "full_name": "LayerDrop",
  "description": "**LayerDrop** is a form of structured [dropout](https://paperswithcode.com/method/dropout) for [Transformer](https://paperswithcode.com/method/transformer) models which has a regularization effect during training and allows for efficient pruning at inference time. It randomly drops layers from the Transformer according to an \"every other\" strategy where pruning with a rate $p$ means dropping the layers at depth $d$ such that $d = 0\\left\\(\\text{mod}\\left(\\text{floor}\\left(\\frac{1}{p}\\right)\\right)\\right)$.",
  "title": "Reducing Transformer Depth on Demand with Structured Dropout",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Crossbow",
  "full_name": "Crossbow",
  "description": "**Crossbow** is a single-server multi-GPU system for training deep learning models that enables users to freely choose their preferred batch size—however small—while scaling to multiple GPUs. Crossbow uses many parallel model replicas and avoids reduced statistical efficiency through a new synchronous training method. [SMA](https://paperswithcode.com/method/slime-mould-algorithm-sma), a synchronous variant of model averaging, is used in which replicas independently explore the solution space with gradient descent, but adjust their search synchronously based on the trajectory of a globally-consistent average model.",
  "title": "CROSSBOW: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "Nesterov Accelerated Gradient",
  "full_name": "Nesterov Accelerated Gradient",
  "description": "**Nesterov Accelerated Gradient** is a momentum-based [SGD](https://paperswithcode.com/method/sgd) optimizer that \"looks ahead\" to where the parameters will be to calculate the gradient **ex post** rather than **ex ante**:\r\n\r\n$$ v\\_{t} = \\gamma{v}\\_{t-1} + \\eta\\nabla\\_{\\theta}J\\left(\\theta-\\gamma{v\\_{t-1}}\\right) $$\r\n$$\\theta\\_{t} = \\theta\\_{t-1} + v\\_{t}$$\r\n\r\nLike SGD with momentum $\\gamma$ is usually set to $0.9$.\r\n\r\nThe intuition is that the [standard momentum](https://paperswithcode.com/method/sgd-with-momentum) method first computes the gradient at the current location and then takes a big jump in the direction of the updated accumulated gradient. In contrast Nesterov momentum first makes a big jump in the direction of the previous accumulated gradient and then measures the gradient where it ends up and makes a correction. The idea being that it is better to correct a mistake after you have made it. \r\n\r\nImage Source: [Geoff Hinton lecture notes](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)",
  "title": null,
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Dutch Eligibility Trace",
  "full_name": "Dutch Eligibility Trace",
  "description": "A **Dutch Eligibility Trace** is a type of [eligibility trace](https://paperswithcode.com/method/eligibility-trace) where the trace increments grow less quickly than the accumulative eligibility trace (helping avoid large variance updates). For the memory vector $\\textbf{e}\\_{t} \\in \\mathbb{R}^{b} \\geq \\textbf{0}$:\r\n\r\n$$\\mathbf{e\\_{0}} = \\textbf{0}$$\r\n\r\n$$\\textbf{e}\\_{t} = \\gamma\\lambda\\textbf{e}\\_{t-1} + \\left(1-\\alpha\\gamma\\lambda\\textbf{e}\\_{t-1}^{T}\\phi\\_{t}\\right)\\phi\\_{t}$$",
  "title": null,
  "collection": "Eligibility Traces",
  "area": "Reinforcement Learning"
}
{
  "name": "PCA Whitening",
  "full_name": "PCA Whitening",
  "description": "**PCA Whitening** is a processing step for image based data that makes input less redundant. Adjacent pixel or feature values can be highly correlated, and whitening through the use of [PCA](https://paperswithcode.com/method/pca) reduces this degree of correlation.\r\n\r\nImage Source: [Wikipedia](https://en.wikipedia.org/wiki/Principal_component_analysis#/media/File:GaussianScatterPCA.svg)",
  "title": null,
  "collection": "Whitening",
  "area": "Computer Vision"
}
{
  "name": "CV-MIM",
  "full_name": "Contrastive Cross-View Mutual Information Maximization",
  "description": "**CV-MIM**, or **Contrastive Cross-View Mutual Information Maximization**, is a representation learning method to disentangle pose-dependent as well as view-dependent factors from 2D human poses. The method trains a network using cross-view mutual information maximization, which maximizes mutual information of the same pose performed from different viewpoints in a contrastive learning manner. It further utilizes two regularization terms to ensure disentanglement and smoothness of the learned representations.",
  "title": "Learning View-Disentangled Human Pose Representation by Contrastive Cross-View Mutual Information Maximization",
  "collection": "Representation Learning",
  "area": "General"
}
{
  "name": "Spatially Separable Self-Attention",
  "full_name": "Spatially Separable Self-Attention",
  "description": "**Spatially Separable Self-Attention**, or **SSSA**, is an [attention module](https://paperswithcode.com/methods/category/attention-modules) used in the [Twins-SVT](https://paperswithcode.com/method/twins-svt) architecture that aims to reduce the computational complexity of [vision transformers](https://paperswithcode.com/methods/category/vision-transformer) for dense prediction tasks (given high-resolution inputs). SSSA is composed of [locally-grouped self-attention](https://paperswithcode.com/method/locally-grouped-self-attention) (LSA) and [global sub-sampled attention](https://paperswithcode.com/method/global-sub-sampled-attention) (GSA).\r\n\r\nFormally, spatially separable self-attention (SSSA) can be written as:\r\n\r\n$$\r\n\\hat{\\mathbf{z}}\\_{i j}^{l}=\\text { LSA }\\left(\\text { LayerNorm }\\left(\\mathbf{z}\\_{i j}^{l-1}\\right)\\right)+\\mathbf{z}\\_{i j}^{l-1} $$\r\n\r\n$$\\mathbf{z}\\_{i j}^{l}=\\mathrm{FFN}\\left(\\operatorname{LayerNorm}\\left(\\hat{\\mathbf{z}}\\_{i j}^{l}\\right)\\right)+\\hat{\\mathbf{z}}\\_{i j}^{l} $$\r\n\r\n$$ \\hat{\\mathbf{z}}^{l+1}=\\text { GSA }\\left(\\text { LayerNorm }\\left(\\mathbf{z}^{l}\\right)\\right)+\\mathbf{z}^{l} $$\r\n\r\n$$ \\mathbf{z}^{l+1}=\\text { FFN }\\left(\\text { LayerNorm }\\left(\\hat{\\mathbf{z}}^{l+1}\\right)\\right)+\\hat{\\mathbf{z}}^{l+1}$$\r\n\r\n$$i \\in\\{1,2, \\ldots ., m\\}, j \\in\\{1,2, \\ldots ., n\\}\r\n$$\r\n\r\nwhere LSA means locally-grouped self-attention within a sub-window; GSA is the global sub-sampled attention by interacting with the representative keys (generated by the sub-sampling functions) from each sub-window $\\hat{\\mathbf{z}}\\_{i j} \\in \\mathcal{R}^{k\\_{1} \\times k\\_{2} \\times C} .$ Both LSA and GSA have multiple heads as in the standard self-attention.",
  "title": "Twins: Revisiting the Design of Spatial Attention in Vision Transformers",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "Sparse Switchable Normalization",
  "full_name": "Sparse Switchable Normalization",
  "description": "**Sparse Switchable Normalization (SSN)** is a variant on [Switchable Normalization](https://paperswithcode.com/method/switchable-normalization) where the importance ratios are constrained to be sparse. Unlike $\\ell_1$ and $\\ell_0$ constraints that impose difficulties in optimization, the constrained optimization problem is turned into feed-forward computation through [SparseMax](https://paperswithcode.com/method/sparsemax), which is a sparse version of [softmax](https://paperswithcode.com/method/softmax).",
  "title": "SSN: Learning Sparse Switchable Normalization via SparsestMax",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "SVM",
  "full_name": "Support Vector Machine",
  "description": "A **Support Vector Machine**, or **SVM**, is a non-parametric supervised learning model. For non-linear classification and regression, they utilise the kernel trick to map inputs to high-dimensional feature spaces. SVMs construct a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. The figure to the right shows the decision function for a linearly separable problem, with three samples on the margin boundaries, called “support vectors”. \r\n\r\nSource: [scikit-learn](https://scikit-learn.org/stable/modules/svm.html)",
  "title": null,
  "collection": "Non-Parametric Classification",
  "area": "General"
}
{
  "name": "MoNet",
  "full_name": "Mixture model network",
  "description": "Mixture model network (MoNet) is a general framework allowing to design convolutional deep architectures on non-Euclidean domains such as graphs and manifolds.\r\n\r\nImage and description from: [Geometric deep learning on graphs and manifolds using mixture model CNNs](https://arxiv.org/pdf/1611.08402.pdf)",
  "title": "Geometric deep learning on graphs and manifolds using mixture model CNNs",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "NesT",
  "full_name": "NesT",
  "description": "**NesT** stacks canonical transformer layers to conduct local self-attention on every image block independently, and then \"nests\" them hierarchically. Coupling of processed information between spatially adjacent blocks is achieved through a proposed block aggregation between every two hierarchies. The overall hierarchical structure can be determined by two key hyper-parameters: patch size $S × S$ and number of block hierarchies $T_d$. All blocks inside each hierarchy share one set of parameters. Given input of image, each image is linearly projected to an embedding. All embeddings are partitioned to blocks and flattened to generate final input. Each transformer layers is composed of a multi-head self attention (MSA) layer followed by a feed-forward fully-connected network (FFN) with skip-connection and Layer normalization. Positional embeddings are added to encode spatial information before feeding into the block. Lastly, a nested hierarchy with block aggregation is built -- every four spatially connected blocks are merged into one.",
  "title": "Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "Associative LSTM",
  "full_name": "Associative LSTM",
  "description": "An **Associative LSTM** combines an [LSTM](https://paperswithcode.com/method/lstm) with ideas from Holographic Reduced Representations (HRRs) to enable key-value storage of data. HRRs use a “binding” operator to implement key-value\r\nbinding between two vectors (the key and its associated content). They natively implement associative arrays; as a byproduct, they can also easily implement stacks, queues, or lists.",
  "title": "Associative Long Short-Term Memory",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "Group Normalization",
  "full_name": "Group Normalization",
  "description": "**Group Normalization** is a normalization layer that divides channels into groups and normalizes the features within each group. GN does not exploit the batch dimension, and its computation is independent of batch sizes. In the case where the group size is 1, it is equivalent to [Instance Normalization](https://paperswithcode.com/method/instance-normalization).\r\n\r\nAs motivation for the method, many classical features like SIFT and HOG had *group-wise* features and involved *group-wise normalization*. For example, a HOG vector is the outcome of several spatial cells where each cell is represented by a normalized orientation histogram.\r\n\r\nFormally, Group Normalization is defined as:\r\n\r\n$$ \\mu\\_{i} = \\frac{1}{m}\\sum\\_{k\\in\\mathcal{S}\\_{i}}x\\_{k} $$\r\n\r\n$$ \\sigma^{2}\\_{i} = \\frac{1}{m}\\sum\\_{k\\in\\mathcal{S}\\_{i}}\\left(x\\_{k}-\\mu\\_{i}\\right)^{2} $$\r\n\r\n$$ \\hat{x}\\_{i} = \\frac{x\\_{i} - \\mu\\_{i}}{\\sqrt{\\sigma^{2}\\_{i}+\\epsilon}} $$\r\n\r\nHere $x$ is the feature computed by a layer, and $i$ is an index. Formally, a Group Norm layer computes $\\mu$ and $\\sigma$ in a set $\\mathcal{S}\\_{i}$ defined as: $\\mathcal{S}\\_{i} = ${$k \\mid k\\_{N} = i\\_{N} ,\\lfloor\\frac{k\\_{C}}{C/G}\\rfloor = \\lfloor\\frac{I\\_{C}}{C/G}\\rfloor $}.\r\n\r\nHere $G$ is the number of groups, which is a pre-defined hyper-parameter ($G = 32$ by default). $C/G$ is the number of channels per group. $\\lfloor$ is the floor operation, and the final term means that the indexes $i$ and $k$ are in the same group of channels, assuming each group of channels are stored in a sequential order along the $C$ axis.",
  "title": "Group Normalization",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "Ape-X",
  "full_name": "Ape-X",
  "description": "**Ape-X** is a distributed architecture for deep reinforcement learning. The algorithm decouples acting from learning: the actors interact with their own instances of the environment by selecting actions according to a shared neural network, and accumulate the resulting experience in a shared [experience replay](https://paperswithcode.com/method/experience-replay) memory; the learner replays samples of experience and updates the neural network. The architecture relies on [prioritized experience replay](https://paperswithcode.com/method/prioritized-experience-replay) to focus only on the most significant data generated by the actors.\r\n\r\nIn contrast to Gorila, Ape-X uses a shared, centralized replay memory, and instead of sampling\r\nuniformly, it prioritizes, to sample the most useful data more often. All communications are batched with the centralized replay, increasing the efficiency and throughput at the cost of some latency. \r\nAnd by learning off-policy, Ape-X has the ability to combine data from many distributed actors, by giving the different actors different exploration policies, broadening the diversity of the experience they jointly encounter.",
  "title": "Distributed Prioritized Experience Replay",
  "collection": "Distributed Reinforcement Learning",
  "area": "Reinforcement Learning"
}
{
  "name": "ZeRO",
  "full_name": "ZeRO",
  "description": "**Zero Redundancy Optimizer (ZeRO)** is a sharded data parallel method for distributed training. ZeRODP removes the memory state redundancies across data-parallel processes by partitioning the model states instead of replicating them, and it retains the compute/communication efficiency by retaining the computational granularity and communication volume of DP using a dynamic communication schedule during training.",
  "title": "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "ARShoe",
  "full_name": "ARShoe",
  "description": "**ARShoe** is a multi-branch network for pose estimation and segmentation tackling the \"try-on\" problem for augmented reality shoes. Consisting of an encoder and a decoder, the multi-branch network is trained to predict keypoints [heatmap](https://paperswithcode.com/method/heatmap) (heatmap), [PAFs](https://paperswithcode.com/method/pafs) heatmap (pafmap), and segmentation results (segmap) simultaneously. Post processes are then performed for a smooth and realistic virtual try-on.",
  "title": "ARShoe: Real-Time Augmented Reality Shoe Try-on System on Smartphones",
  "collection": "6D Pose Estimation Models",
  "area": "Computer Vision"
}
{
  "name": "SMOT",
  "full_name": "Single-Shot Multi-Object Tracker",
  "description": "**Single-Shot Multi-Object Tracker** or **SMOT**, is a tracking framework that converts any single-shot detector (SSD) model into an online multiple object tracker, which emphasizes simultaneously detecting and tracking of the object paths. Contrary to the existing tracking by detection approaches which suffer from errors made by the object detectors, SMOT adopts the recently proposed scheme of tracking by re-detection.\r\n\r\nThe proposed SMOT consists of two stages. The first stage generates temporally consecutive tracklets by exploring the temporal and spatial correlations from previous frame. The second stage performs online linking of the tracklets to generate a face track for each person (better view in color).",
  "title": "SMOT: Single-Shot Multi Object Tracking",
  "collection": "Multi-Object Tracking Models",
  "area": "Computer Vision"
}
{
  "name": "GenSAM",
  "full_name": "Generalizable SAM",
  "description": "The Segment Anything Model (SAM) shows remarkable segmentation ability with sparse prompts like points. However, manual prompt is not always feasible, as it may not be accessible in real-world application. In this work, we aim to eliminate the need for manual prompt.The key idea is to employ Cross-modal Chains of Thought Prompting (CCTP) to reason visual prompts using the semantic information given by a generic text prompt. We introduce a test-time adaptation per-instance mechanism called Generalizable SAM (GenSAM) to automatically generate and optimize visual prompts the generic task prompt.  CCTP maps a single generic text prompt onto image-specific consensus foreground and background heatmaps using vision-language models, acquiring reliable visual prompts. Moreover, to test-time adapt the visual prompts, we further propose Progressive Mask Generation (PMG) to iteratively reweight the input image, guiding the model to focus on the targets in a coarse-to-fine manner.Crucially, all network parameters are fixed, avoiding the need for additional training.Experiments demonstrate the superiority of GenSAM. Experiments on three benchmarks demonstrate that GenSAM outperforms point supervision approaches and achieves comparable results to scribble supervision ones, solely relying on general task descriptions as prompts.",
  "title": "Relax Image-Specific Prompt Requirement in SAM: A Single Generic Prompt for Segmenting Camouflaged Objects",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "Base Boosting",
  "full_name": "Base Boosting",
  "description": "In the setting of multi-target regression, base boosting permits us to incorporate prior knowledge into the learning mechanism of gradient boosting (or Newton boosting, etc.). Namely, from the vantage of statistics, base boosting is a way of building the following additive expansion in a set of elementary basis functions:\r\n\\begin{equation}\r\nh_{j}(X ; \\{ \\alpha_{j}, \\theta_{j} \\}) = X_{j} + \\sum_{k=1}^{K_{j}} \\alpha_{j,k} b(X ; \\theta_{j,k}),\r\n\\end{equation}\r\nwhere \r\n$X$ is an example from the domain $\\mathcal{X},$\r\n$\\{\\alpha_{j}, \\theta_{j}\\} = \\{\\alpha_{j,1},\\dots, \\alpha_{j,K_{j}},\\theta_{j,1},\\dots,\\theta_{j,K_{j}}\\}$ collects the expansion coefficients and parameter sets,\r\n$X_{j}$ is the image of $X$ under the $j$th coordinate function (a prediction from a user-specified model),\r\n$K_{j}$ is the number of basis functions in the linear sum,\r\n$b(X; \\theta_{j,k})$ is a real-valued function of the example $X,$ characterized by a parameter set $\\theta_{j,k}.$\r\n\r\nThe aforementioned additive expansion differs from the \r\n[standard  additive expansion](https://projecteuclid.org/download/pdf_1/euclid.aos/1013203451):\r\n\\begin{equation}\r\nh_{j}(X ; \\{ \\alpha_{j}, \\theta_{j}\\}) = \\alpha_{j, 0} + \\sum_{k=1}^{K_{j}} \\alpha_{j,k} b(X ; \\theta_{j,k}),\r\n\\end{equation}\r\nas it replaces the constant offset value $\\alpha_{j, 0}$ with a prediction from a user-specified model. In essence, this modification permits us to incorporate prior knowledge into the for loop of gradient boosting, as the for loop proceeds to build the linear sum by computing residuals that depend upon predictions from the user-specified model instead of the optimal constant model: $\\mbox{argmin} \\sum_{i=1}^{m_{train}} \\ell_{j}(Y_{j}^{(i)}, c),$ where $m_{train}$ denotes the number of training examples, $\\ell_{j}$ denotes a single-target loss function, and $c \\in \\mathbb{R}$ denotes a real number, e.g, $\\mbox{argmin} \\sum_{i=1}^{m_{train}} (Y_{j}^{(i)} - c)^{2} = \\frac{\\sum_{i=1}^{m_{train}} Y_{j}^{(i)}}{m_{train}}.$",
  "title": "Boosting on the shoulders of giants in quantum device calibration",
  "collection": "Generalized Additive Models",
  "area": "General"
}
{
  "name": "SReLU",
  "full_name": "S-shaped ReLU",
  "description": "The **S-shaped Rectified Linear Unit**, or **SReLU**, is an activation function for neural networks. It learns both convex and non-convex functions, imitating the multiple function forms given by the two fundamental laws, namely  the Webner-Fechner law and the Stevens law, in psychophysics and neural sciences. Specifically, SReLU consists of three piecewise linear functions, which are formulated by four learnable parameters. \r\n\r\nThe SReLU is defined as a mapping:\r\n\r\n$$ f\\left(x\\right) = t\\_{i}^{r}  + a^{r}\\_{i}\\left(x\\_{i}-t^{r}\\_{i}\\right) \\text{ if } x\\_{i} \\geq t^{r}\\_{i} $$\r\n$$ f\\left(x\\right) = x\\_{i} \\text{ if } t^{r}\\_{i} > x > t\\_{i}^{l}$$\r\n$$ f\\left(x\\right) = t\\_{i}^{l}  + a^{l}\\_{i}\\left(x\\_{i}-t^{l}\\_{i}\\right) \\text{ if } x\\_{i} \\leq t^{l}\\_{i} $$\r\n\r\nwhere $t^{l}\\_{i}$, $t^{r}\\_{i}$ and $a^{l}\\_{i}$ are learnable parameters of the network $i$ and indicates that the SReLU can differ in different channels. The parameter $a^{r}\\_{i}$ represents the slope of the right line with input above a set threshold. $t^{r}\\_{i}$ and $t^{l}\\_{i}$ are thresholds in positive and negative directions respectively.\r\n\r\nSource: [Activation Functions](https://arxiv.org/pdf/1811.03378.pdf)",
  "title": "Deep Learning with S-shaped Rectified Linear Activation Units",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "HAPPIER",
  "full_name": "Hierarchical Average Precision training for Pertinent ImagE Retrieval",
  "description": "",
  "title": "Hierarchical Average Precision Training for Pertinent Image Retrieval",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "Metropolis Hastings",
  "full_name": "Metropolis Hastings",
  "description": "**Metropolis-Hastings** is a Markov Chain Monte Carlo (MCMC) algorithm for approximate inference. It allows for sampling from a probability distribution where direct sampling is difficult - usually owing to the presence of an intractable integral.\r\n\r\nM-H consists of a proposal distribution $q\\left(\\theta^{'}\\mid\\theta\\right)$ to draw a parameter value. To decide whether $\\theta^{'}$ is accepted or rejected, we then calculate a ratio:\r\n\r\n$$ \\frac{p\\left(\\theta^{'}\\mid{D}\\right)}{p\\left(\\theta\\mid{D}\\right)} $$\r\n\r\nWe then draw a random number $r \\in \\left[0, 1\\right]$ and accept if it is under the ratio, reject otherwise. If we accept, we set $\\theta_{i} = \\theta^{'}$ and repeat.\r\n\r\nBy the end we have a sample of $\\theta$ values that we can use to form quantities over an approximate posterior, such as the expectation and uncertainty bounds. In practice, we typically have a period of tuning to achieve an acceptable acceptance ratio for the algorithm, as well as a warmup period to reduce bias towards initialization values.\r\n\r\nImage: [Samuel Hudec](https://static1.squarespace.com/static/52e69d46e4b05a145935f24d/t/5a7dbadcf9619a745c5b2513/1518189289690/Stan.pdf)",
  "title": null,
  "collection": "Markov Chain Monte Carlo",
  "area": "General"
}
{
  "name": "Tacotron 2",
  "full_name": "Tacotron2",
  "description": "**Tacotron 2** is a neural network architecture for speech synthesis directly from text. It consists of two components:\r\n\r\n- a recurrent sequence-to-sequence feature prediction network with\r\nattention which predicts a sequence of mel spectrogram frames from\r\nan input character sequence\r\n- a modified version of [WaveNet](https://paperswithcode.com/method/wavenet) which generates time-domain waveform samples conditioned on the\r\npredicted mel spectrogram frames\r\n\r\nIn contrast to the original [Tacotron](https://paperswithcode.com/method/tacotron), Tacotron 2 uses simpler building blocks, using vanilla [LSTM](https://paperswithcode.com/method/lstm) and convolutional layers in the encoder and decoder instead of [CBHG](https://paperswithcode.com/method/cbhg) stacks and [GRU](https://paperswithcode.com/method/gru) recurrent layers. Tacotron 2 does not use a “reduction factor”, i.e., each decoder step corresponds to a single spectrogram frame. Location-sensitive attention is used instead of [additive attention](https://paperswithcode.com/method/additive-attention).",
  "title": "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions",
  "collection": "Text-to-Speech Models",
  "area": "Audio"
}
{
  "name": "srBTAW (BTW)",
  "full_name": "Self-regularizing Boundary Time and Amplitude Warping",
  "description": "",
  "title": "Technical Reports Compilation: Detecting the Fire Drill Anti-pattern Using Source Code and Issue-Tracking Data",
  "collection": "Time Series Analysis",
  "area": "Sequential"
}
{
  "name": "PMLM",
  "full_name": "Probabilistically Masked Language Model",
  "description": "**Probabilistically Masked Language Model**, or **PMLM**, is a type of language model that utilizes a probabilistic masking scheme, aiming to bridge the gap between masked and autoregressive language models. The basic idea behind the connection of two categories of models is similar to MADE by Germain et al (2015). PMLM is a masked language model with a probabilistic masking scheme, which defines the way sequences are masked by following a probabilistic distribution. The authors employ a simple uniform distribution of the masking ratio and name the model as u-PMLM.",
  "title": "Probabilistically Masked Language Model Capable of Autoregressive Generation in Arbitrary Word Order",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "STA-LSTM",
  "full_name": "Spatio-Temporal Attention LSTM",
  "description": "In human action recognition, \r\neach type of action  generally only depends \r\non a few specific kinematic joints. Furthermore, over time, multiple actions may be performed.\r\nMotivated by these observations, Song et al. proposed \r\na joint spatial and temporal attention network based on LSTM, to adaptively find discriminative features and keyframes. \r\nIts main attention-related components are a spatial attention sub-network, to select important regions, and a temporal attention sub-network, to select key frames. The spatial attention sub-network can be written as:\r\n\\begin{align}\r\n    s_{t} &= U_{s}\\tanh(W_{xs}X_{t} + W_{hs}h_{t-1}^{s} + b_{si}) + b_{so}\r\n\\end{align}\r\n\\begin{align}\r\n    \\alpha_{t} &= \\text{Softmax}(s_{t})\r\n\\end{align}\r\n\\begin{align}\r\n    Y_{t} &= \\alpha_{t}  X_{t} \r\n\\end{align}\r\nwhere $X_{t}$ is the input feature at time $t$, $U_{s}$, $W_{hs}$, $b_{si}$, and $b_{so}$ are learnable parameters, and $h_{t-1}^{s}$ is the hidden state at step $t-1$. Note that use of the hidden state $h$ means  the attention process takes  temporal relationships into consideration.\r\n\r\nThe temporal attention sub-network is similar to the spatial branch and produces its attention map using:\r\n\\begin{align}\r\n    \\beta_{t} = \\delta(W_{xp}X_{t} + W_{hp}h_{t-1}^{p} + b_{p}). \r\n\\end{align}\r\nIt adopts a ReLU function instead of a normalization function for ease of optimization. It also uses a regularized objective function to improve  convergence.\r\n\r\nOverall, this paper presents a joint spatiotemporal attention method\r\nto focus on important joints and keyframes, \r\nwith excellent results on the action recognition task.",
  "title": "An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "CORAD",
  "full_name": "CORAD: Correlation-Aware Compression of Massive Time Series using Sparse Dictionary Coding",
  "description": "",
  "title": "CORAD: Correlation-Aware Compression of Massive Time Series using Sparse Dictionary Coding",
  "collection": "Model Compression",
  "area": "General"
}
{
  "name": "Robust Predictable Control",
  "full_name": "Robust Predictable Control",
  "description": "**Robust Predictable Control**, or **RPC**, is an RL algorithm for learning policies that uses only a few bits of information. RPC brings together ideas from information bottlenecks, model-based RL, and bits-back coding. The main idea of RPC is that if the agent can accurately predict the future, then the agent will not need to observe as many bits from future observations. Precisely, the agent will learn a latent dynamics model that predicts the next representation using the current representation and action. In addition to predicting the future, the agent can also decrease the number of bits by changing its behavior. States where the dynamics are hard to predict will require more bits, so the agent will prefer visiting states where its learned model can accurately predict the next state.",
  "title": "Robust Predictable Control",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "Ape-X DPG",
  "full_name": "Ape-X DPG",
  "description": "**Ape-X DPG** combines [DDPG](https://paperswithcode.com/method/ddpg) with distributed [prioritized experience replay](https://paperswithcode.com/method/prioritized-experience-replay) through the [Ape-X](https://paperswithcode.com/method/ape-x) architecture.",
  "title": "Distributed Prioritized Experience Replay",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "BezierAlign",
  "full_name": "BezierAlign",
  "description": "**BezierAlign** is a feature sampling method for arbitrarily-shaped scene text recognition that exploits parameterization nature of a compact Bezier curve bounding box.  Unlike RoIAlign, the shape of sampling grid of BezierAlign is not rectangular. Instead, each column of the arbitrarily-shaped grid is orthogonal to the Bezier curve boundary of the text. The sampling points have equidistant interval in width and height, respectively, which are bilinear interpolated with respect to the coordinates.\r\n\r\nFormally given an input feature map and Bezier curve control points, we concurrently process all the output pixels of the rectangular output feature map with size $h\\_{\\text {out }} \\times w\\_{\\text {out }}$. Taking pixel $g\\_{i}$ with position $\\left(g\\_{i w}, g\\_{i h}\\right)$ (from output feature map) as an example, we calculate $t$ by:\r\n\r\n$$\r\nt=\\frac{g\\_{i w}}{w\\_{o u t}}\r\n$$\r\n\r\nWe then calculate the point of upper Bezier curve boundary $tp$ and lower Bezier curve boundary $bp$. Using $tp$ and $bp$, we can linearly index the sampling point $op$ by:\r\n\r\n$$\r\nop=bp \\cdot \\frac{g\\_{i h}}{h\\_{\\text {out }}}+tp \\cdot\\left(1-\\frac{g\\_{i h}}{h\\_{\\text {out }}}\\right)\r\n$$\r\n\r\nWith the position of $op$, we can easily apply bilinear interpolation to calculate the result. Comparisons among previous sampling methods and BezierAlign are shown in the Figure.",
  "title": "ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network",
  "collection": "RoI Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "PQ-Transformer",
  "full_name": "PointQuad-Transformer",
  "description": "**PQ-Transformer**, or **PointQuad-Transformer**, is a [Transformer](https://paperswithcode.com/method/transformer)-based architecture that predicts 3D objects and layouts simultaneously, using point cloud inputs. Unlike existing methods that either estimate layout keypoints or edges, room layouts are directly parameterized as a set of quads. Along with the quad representation, a physical constraint loss function is used that discourages object-layout interference.\r\n\r\nGiven an input 3D point cloud of $N$ points, the point cloud feature learning backbone extracts $M$ context-aware point features of $\\left(3+C\\right)$ dimensions, through sampling and grouping. A voting module and a farthest point sampling (FPS) module are used to generate $K\\_{1}$ object proposals and $K\\_{2}$ quad proposals respectively. Then the proposals are processed by a transformer decoder to further refine proposal features. Through several feedforward layers and non-maximum suppression (NMS), the proposals become the final object bounding boxes and layout quads.",
  "title": "PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds",
  "collection": "Point Cloud Models",
  "area": "Computer Vision"
}
{
  "name": "VGG-19",
  "full_name": "Visual Geometry Group 19 Layer CNN",
  "description": "",
  "title": "Very Deep Convolutional Networks for Large-Scale Image Recognition",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "GTS",
  "full_name": "Goal-Driven Tree-Structured Neural Model",
  "description": "",
  "title": "A Goal-Driven Tree-Structured Neural Model for Math Word Problems",
  "collection": "Sequence To Sequence Models",
  "area": "Sequential"
}
{
  "name": "Fastformer",
  "full_name": "Fastformer",
  "description": "**Fastformer** is an type of [Transformer](https://paperswithcode.com/method/transformer) which uses [additive attention](https://www.paperswithcode.com/method/additive-attention) as a building block. Instead of modeling the pair-wise interactions between tokens, [additive attention](https://paperswithcode.com/method/additive-attention) is used to model global contexts, and then each token representation is further transformed based on its interaction with global context representations.",
  "title": "Fastformer: Additive Attention Can Be All You Need",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Rainbow DQN",
  "full_name": "Rainbow DQN",
  "description": "**Rainbow DQN** is an extended [DQN](https://paperswithcode.com/method/dqn) that combines several improvements into a single learner. Specifically:\r\n\r\n- It uses [Double Q-Learning](https://paperswithcode.com/method/double-q-learning) to tackle overestimation bias.\r\n- It uses [Prioritized Experience Replay](https://paperswithcode.com/method/prioritized-experience-replay) to prioritize important transitions.\r\n- It uses [dueling networks](https://paperswithcode.com/method/dueling-network).\r\n- It uses [multi-step learning](https://paperswithcode.com/method/n-step-returns).\r\n- It uses distributional reinforcement learning instead of the expected return.\r\n- It uses noisy linear layers for exploration.",
  "title": "Rainbow: Combining Improvements in Deep Reinforcement Learning",
  "collection": "Q-Learning Networks",
  "area": "Reinforcement Learning"
}
{
  "name": "DANet",
  "full_name": "Dual Attention Network",
  "description": "In the field of scene segmentation,\r\nencoder-decoder structures cannot make use of the global relationships \r\nbetween objects, whereas RNN-based structures \r\nheavily rely on the output of the long-term memorization.\r\nTo address the above problems, \r\nFu et al. proposed a novel framework, \r\n the dual attention network (DANet), \r\nfor natural scene image segmentation. \r\nUnlike CBAM and BAM, it adopts a self-attention mechanism \r\ninstead of simply stacking convolutions to compute the spatial attention map,\r\nwhich enables the network to capture global information directly. \r\n\r\nDANet uses in parallel a position attention module and a channel attention module to capture feature dependencies in spatial and channel domains. Given the input feature map $X$, convolution layers are applied first in the position attention module to obtain new feature maps. Then the position attention module selectively aggregates the features at each position using a weighted sum of features at all positions, where the weights are determined by feature similarity between corresponding pairs of positions. The channel attention module has a similar form except for dimensional reduction to model cross-channel relations. Finally the outputs from the two branches are fused to obtain final feature representations. For simplicity, we reshape the feature map $X$ to $C\\times (H \\times W)$ whereupon the overall process can be written as \r\n\\begin{align}\r\n    Q,\\quad K,\\quad V &= W_qX,\\quad W_kX,\\quad W_vX\r\n\\end{align}\r\n\\begin{align}\r\n    Y^\\text{pos} &=  X+ V\\text{Softmax}(Q^TK)\r\n\\end{align}\r\n\\begin{align}\r\n    Y^\\text{chn} &=  X+ \\text{Softmax}(XX^T)X \r\n\\end{align}\r\n\\begin{align}\r\n    Y &= Y^\\text{pos} + Y^\\text{chn}\r\n\\end{align}\r\nwhere $W_q$, $W_k$, $W_v \\in \\mathbb{R}^{C\\times C}$ are used to generate new feature maps.   \r\n\r\nThe position attention module enables\r\nDANet to capture long-range contextual information\r\nand adaptively integrate similar features at any scale\r\nfrom a global viewpoint,\r\nwhile the channel attention module is responsible for \r\nenhancing useful channels \r\nas well as suppressing noise. \r\nTaking spatial and channel \r\nrelationships into consideration explicitly\r\nimproves the feature representation for scene segmentation.\r\nHowever, it is computationally costly, especially for large input feature maps.",
  "title": "Dual Attention Network for Scene Segmentation",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Mask R-CNN",
  "full_name": "Mask R-CNN",
  "description": "**Mask R-CNN** extends [Faster R-CNN](http://paperswithcode.com/method/faster-r-cnn) to solve instance segmentation tasks. It achieves this by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. In principle, Mask R-CNN is an intuitive extension of Faster [R-CNN](https://paperswithcode.com/method/r-cnn), but constructing the mask branch properly is critical for good results. \r\n\r\nMost importantly, Faster R-CNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is evident in how [RoIPool](http://paperswithcode.com/method/roi-pooling), the *de facto* core operation for attending to instances, performs coarse spatial quantization for feature extraction. To fix the misalignment, Mask R-CNN utilises a simple, quantization-free layer, called [RoIAlign](http://paperswithcode.com/method/roi-align), that faithfully preserves exact spatial locations. \r\n\r\nSecondly, Mask R-CNN *decouples* mask and class prediction: it predicts a binary mask for each class independently, without competition among classes, and relies on the network's RoI classification branch to predict the category. In contrast, an [FCN](http://paperswithcode.com/method/fcn) usually perform per-pixel multi-class categorization, which couples segmentation and classification.",
  "title": "Mask R-CNN",
  "collection": "Instance Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "GCN",
  "full_name": "Graph Convolutional Network",
  "description": "A **Graph Convolutional Network**, or **GCN**, is an approach for semi-supervised learning on graph-structured data. It is based on an efficient variant of [convolutional neural networks](https://paperswithcode.com/methods/category/convolutional-neural-networks) which operate directly on graphs. The choice of convolutional architecture is motivated via a localized first-order approximation of spectral graph convolutions. The model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes.",
  "title": "Semi-Supervised Classification with Graph Convolutional Networks",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "Denoising Autoencoder",
  "full_name": "Denoising Autoencoder",
  "description": "A **Denoising Autoencoder** is a modification on the [autoencoder](https://paperswithcode.com/method/autoencoder) to prevent the network learning the identity function. Specifically, if the autoencoder is too big, then it can just learn the data, so the output equals the input, and does not perform any useful representation learning or dimensionality reduction. Denoising autoencoders solve this problem by corrupting the input data on purpose, adding noise or masking some of the input values.\r\n\r\nImage Credit: [Kumar et al](https://www.semanticscholar.org/paper/Static-hand-gesture-recognition-using-stacked-Kumar-Nandi/5191ddf3f0841c89ba9ee592a2f6c33e4a40d4bf)",
  "title": null,
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "m-arcsinh",
  "full_name": "modified arcsinh",
  "description": "",
  "title": "m-arcsinh: An Efficient and Reliable Function for SVM and MLP in scikit-learn",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Reliability Balancing",
  "full_name": "Reliability Balancing",
  "description": "",
  "title": "ARBEx: Attentive Feature Extraction with Reliability Balancing for Robust Facial Expression Learning",
  "collection": "Feature Matching",
  "area": "General"
}
{
  "name": "Ghost Module",
  "full_name": "Ghost Module",
  "description": "A **Ghost Module** is an image block for convolutional neural network that aims to generate more features by using fewer parameters. Specifically, an ordinary convolutional layer in deep neural networks is split into two parts. The first part involves ordinary convolutions but their total number is controlled. Given the intrinsic feature maps from the first part, a series of simple linear operations are applied for generating more feature maps. \r\n\r\nGiven the widely existing redundancy in intermediate feature maps calculated by mainstream CNNs, ghost modules aim to reduce them. In practice, given the input data $X\\in\\mathbb{R}^{c\\times h\\times w}$, where $c$ is the number of input channels and $h$ and $w$ are the height and width of the input data, respectively,  the operation of an arbitrary convolutional layer for producing $n$ feature maps can be formulated as\r\n\r\n$$\r\nY = X*f+b,\r\n$$\r\n\r\nwhere $*$ is the [convolution](https://paperswithcode.com/method/convolution) operation, $b$ is the bias term, $Y\\in\\mathbb{R}^{h'\\times w'\\times n}$ is the output feature map with $n$ channels, and $f\\in\\mathbb{R}^{c\\times k\\times k \\times n}$ is the convolution filters in this layer. In addition, $h'$ and $w'$ are the height and width of the output data, and $k\\times k$ is the kernel size of convolution filters $f$, respectively. During this convolution procedure, the required number of FLOPs can be calculated as $n\\cdot h'\\cdot w'\\cdot c\\cdot k\\cdot k$, which is often as large as hundreds of thousands since the number of filters $n$ and the channel number $c$ are generally very large (e.g. 256 or 512).\r\n\r\nHere, the number of parameters (in $f$ and $b$) to be optimized is explicitly determined by the dimensions of input and output feature maps. The output feature maps of convolutional layers often contain much redundancy, and some of them could be similar with each other. We point out that it is unnecessary to generate these redundant feature maps one by one with large number of FLOPs and parameters. Suppose that the output feature maps are *ghosts* of a handful of intrinsic feature maps with some cheap transformations. These intrinsic feature maps are often of smaller size and produced by ordinary convolution filters. Specifically, $m$ intrinsic feature maps $Y'\\in\\mathbb{R}^{h'\\times w'\\times m}$ are generated using a primary convolution:\r\n\r\n$$\r\nY' = X*f',\r\n$$\r\n\r\nwhere $f'\\in\\mathbb{R}^{c\\times k\\times k \\times m}$ is the utilized filters, $m\\leq n$ and the bias term is omitted for simplicity. The hyper-parameters such as filter size, stride, padding, are the same as those in the ordinary convolution to keep the spatial size (ie $h'$ and $w'$) of the output feature maps consistent. To further obtain the desired $n$ feature maps, we apply a series of cheap linear operations on each intrinsic feature in $Y'$ to generate $s$ ghost features according to the following function:\r\n\r\n$$\r\ny_{ij} = \\Phi_{i,j}(y'_i),\\quad \\forall\\; i = 1,...,m,\\;\\; j = 1,...,s,\r\n$$\r\n\r\nwhere $y'\\_i$ is the $i$-th intrinsic feature map in $Y'$, $\\Phi\\_{i,j}$ in the above function is the $j$-th (except the last one) linear operation for generating the $j$-th ghost feature map $y_{ij}$, that is to say, $y'\\_i$ can have one or more ghost feature maps $\\{y\\_{ij}\\}\\_{j=1}^{s}$. The last $\\Phi\\_{i,s}$ is the identity mapping for preserving the intrinsic feature maps. we can obtain $n=m\\cdot s$ feature maps $Y=[y\\_{11},y\\_{12},\\cdots,y\\_{ms}]$ as the output data of a Ghost module. Note that the linear operations $\\Phi$ operate on each channel whose computational cost is much less than the ordinary convolution. In practice, there could be several different linear operations in a Ghost module, eg $3\\times 3$ and $5\\times5$ linear kernels, which will be analyzed in the experiment part.",
  "title": "GhostNet: More Features from Cheap Operations",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "VQSVD",
  "full_name": "VQSVD",
  "description": "**Variational Quantum Singular Value Decomposition** is a variational quantum algorithm for singular value decomposition (VQSVD). By exploiting the variational principles for singular values and the Ky Fan Theorem, a novel loss function is designed such that two quantum neural networks (or parameterized quantum circuits) could be trained to learn the singular vectors and output the corresponding singular values.",
  "title": "Variational Quantum Singular Value Decomposition",
  "collection": "Quantum Methods",
  "area": "General"
}
{
  "name": "Bottleneck Transformer",
  "full_name": "Bottleneck Transformer",
  "description": "The **Bottleneck Transformer (BoTNet) ** is an image classification model that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation. By just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a [ResNet](https://paperswithcode.com/method/resnet) and no other changes, the approach improves upon baselines significantly on instance segmentation and object detection while also reducing the parameters, with minimal overhead in latency.",
  "title": "Bottleneck Transformers for Visual Recognition",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "Co-Correcting",
  "full_name": "Co-Correcting",
  "description": "**Co-Correcting** is a noise-tolerant deep learning framework for medical image classification based on mutual learning and annotation correction. It consists of three modules: the dual-network architecture, the curriculum learning module, and the label correction module.",
  "title": "Co-Correcting: Noise-tolerant Medical Image Classification via mutual Label Correction",
  "collection": "Medical Image Models",
  "area": "Computer Vision"
}
{
  "name": "PCA",
  "full_name": "Principal Components Analysis",
  "description": "**Principle Components Analysis (PCA)** is an unsupervised method primary used for dimensionality reduction within machine learning.  PCA is calculated via a singular value decomposition (SVD) of the design matrix, or alternatively, by calculating the covariance matrix of the data and performing eigenvalue decomposition on the covariance matrix. The results of PCA provide a low-dimensional picture of the structure of the data and the leading (uncorrelated) latent factors determining variation in the data.\r\n\r\nImage Source: [Wikipedia](https://en.wikipedia.org/wiki/Principal_component_analysis#/media/File:GaussianScatterPCA.svg)",
  "title": null,
  "collection": "Dimensionality Reduction",
  "area": "General"
}
{
  "name": "ESACL",
  "full_name": "Enhanced Seq2Seq Autoencoder via Contrastive Learning",
  "description": "**ESACL**, or **Enhanced Seq2Seq Autoencoder via Contrastive Learning**, is a denoising sequence-to-sequence (seq2seq) autoencoder via contrastive learning for abstractive text summarization. The model adopts a standard [Transformer](https://paperswithcode.com/method/transformer)-based architecture with a multilayer bi-directional encoder and an autoregressive decoder. To enhance its denoising ability, self-supervised contrastive learning is incorporated along with various sentence-level document augmentation.",
  "title": "Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Noisy Student",
  "full_name": "Noisy Student",
  "description": "**Noisy Student Training** is a semi-supervised learning approach. It extends the idea of self-training\r\nand distillation with the use of equal-or-larger student models and noise added to the student during learning. It has three main steps: \r\n\r\n1. train a teacher model on labeled images\r\n2. use the teacher to generate pseudo labels on unlabeled images\r\n3. train a student model on the combination of labeled images and pseudo labeled images. \r\n\r\nThe algorithm is iterated a few times by treating the student as a teacher to relabel the unlabeled data and training a new student.\r\n\r\nNoisy Student Training seeks to improve on self-training and distillation in two ways. First, it makes the student larger than, or at least equal to, the teacher so the student can better learn from a larger dataset. Second, it adds noise to the student so the noised student is forced to learn harder from the pseudo labels. To noise the student, it uses input noise such as [RandAugment](https://paperswithcode.com/method/randaugment) data augmentation, and model noise such as [dropout](https://paperswithcode.com/method/dropout) and [stochastic depth](https://paperswithcode.com/method/stochastic-depth) during training.",
  "title": "Self-training with Noisy Student improves ImageNet classification",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "Multi-Attention Network",
  "full_name": "Multi-Attention Network",
  "description": "",
  "title": "Multi-Attention Network for One Shot Learning",
  "collection": "Attention",
  "area": "General"
}
{
  "name": "Beta-VAE",
  "full_name": "Beta-VAE",
  "description": "**Beta-VAE** is a type of variational autoencoder that seeks to discover disentangled latent factors. It modifies [VAEs](https://paperswithcode.com/method/vae) with an adjustable hyperparameter $\\beta$ that balances latent channel capacity and independence constraints with reconstruction accuracy. The idea is to maximize the probability of generating the real data while keeping the distance between the real and estimated distributions small, under a threshold $\\epsilon$. We can use the Kuhn-Tucker conditions to write this as a single equation:\r\n\r\n$$ \\mathcal{F}\\left(\\theta, \\phi, \\beta; \\mathbf{x}, \\mathbf{z}\\right) = \\mathbb{E}\\_{q\\_{\\phi}\\left(\\mathbf{z}|\\mathbf{x}\\right)}\\left[\\log{p}\\_{\\theta}\\left(\\mathbf{x}\\mid\\mathbf{z}\\right)\\right] - \\beta\\left[D\\_{KL}\\left(\\log{q}\\_{\\theta}\\left(\\mathbf{z}\\mid\\mathbf{x}\\right)||p\\left(\\mathbf{z}\\right)\\right) - \\epsilon\\right]$$\r\n\r\nwhere the KKT multiplier $\\beta$ is the regularization coefficient that constrains the capacity of the latent channel $\\mathbf{z}$ and puts implicit independence pressure on the learnt posterior due to the isotropic nature of the Gaussian prior $p\\left(\\mathbf{z}\\right)$.\r\n\r\nWe write this again using the complementary slackness assumption to get the Beta-VAE formulation:\r\n\r\n$$ \\mathcal{F}\\left(\\theta, \\phi, \\beta; \\mathbf{x}, \\mathbf{z}\\right) \\geq  \\mathcal{L}\\left(\\theta, \\phi, \\beta; \\mathbf{x}, \\mathbf{z}\\right) = \\mathbb{E}\\_{q\\_{\\phi}\\left(\\mathbf{z}|\\mathbf{x}\\right)}\\left[\\log{p}\\_{\\theta}\\left(\\mathbf{x}\\mid\\mathbf{z}\\right)\\right] - \\beta\\{D}\\_{KL}\\left(\\log{q}\\_{\\theta}\\left(\\mathbf{z}\\mid\\mathbf{x}\\right)||p\\left(\\mathbf{z}\\right)\\right)$$",
  "title": "beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "ThunderNet",
  "full_name": "ThunderNet",
  "description": "**ThunderNet** is a two-stage object detection model. The design of ThunderNet aims at the computationally expensive structures in state-of-the-art two-stage detectors. The backbone utilises a [ShuffleNetV2](https://paperswithcode.com/method/shufflenet-v2) inspired network called [SNet](https://paperswithcode.com/method/snet) designed for object detection. In the detection part, ThunderNet follows the detection head design in Light-Head [R-CNN](https://paperswithcode.com/method/r-cnn), and further compresses the [RPN](https://paperswithcode.com/method/rpn) and R-CNN subnet. To eliminate the performance degradation induced by small backbones and small feature maps, ThunderNet uses two new efficient architecture blocks, [Context Enhancement Module](https://paperswithcode.com/method/context-enhancement-module) (CEM) and [Spatial Attention Module](https://paperswithcode.com/method/spatial-attention-module) (SAM). CEM combines the feature maps from multiple scales to leverage local and global context information, while SAM uses the information learned in RPN to refine the feature distribution in RoI warping.",
  "title": "ThunderNet: Towards Real-time Generic Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "ILVR",
  "full_name": "Iterative Latent Variable Refinement",
  "description": "**Iterative Latent Variable Refinement**, or **ILVR**, is a method to guide the generative process in denoising diffusion probabilistic models (DDPMs) to generate high-quality images based on a given reference image. ILVR conditions the generation process in well-performing unconditional DDPM. Each transition in the generation process is refined utilizing a given reference image. By matching each latent variable, ILVR ensures the given condition in each transition thus enables sampling from a conditional distribution. Thus, ILVR generates high-quality images sharing desired semantics.",
  "title": "ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models",
  "collection": "Generative Training",
  "area": "Computer Vision"
}
{
  "name": "NFR",
  "full_name": "Negative Face Recognition",
  "description": "**Negative Face Recognition**, or **NFR**, is a face recognition approach that enhances the soft-biometric privacy on the template-level by representing face templates in a complementary (negative) domain. While ordinary templates characterize facial properties of an individual, negative templates describe facial properties that does not exist for this individual. This suppresses privacy-sensitive information from stored templates. Experiments are conducted on two publicly available datasets captured under controlled and uncontrolled scenarios on three privacy-sensitive attributes.",
  "title": "Unsupervised Enhancement of Soft-biometric Privacy with Negative Face Recognition",
  "collection": "Face Recognition Models",
  "area": "Computer Vision"
}
{
  "name": "FFMv2",
  "full_name": "Feature Fusion Module v2",
  "description": "**Feature Fusion Module v2** is a feature fusion module from the [M2Det](https://paperswithcode.com/method/m2det) object detection model, and is crucial for constructing the final multi-level feature pyramid. They use [1x1 convolution](https://paperswithcode.com/method/1x1-convolution) layers to compress the channels of the input features and use a concatenation operation to aggregate these feature map. FFMv2 takes the base feature and the largest output feature map of the previous [Thinned U-Shape Module](https://paperswithcode.com/method/tum) (TUM) – these two are of the same scale – as input, and produces the fused feature for the next TUM.",
  "title": "M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "Selective Kernel Convolution",
  "full_name": "Selective Kernel Convolution",
  "description": "A **Selective Kernel Convolution** is a [convolution](https://paperswithcode.com/method/convolution) that enables neurons to adaptively adjust their RF sizes among multiple kernels with different kernel sizes. Specifically, the SK convolution has three operators – Split, Fuse and Select. Multiple branches with different kernel sizes are fused using\r\n[softmax](https://paperswithcode.com/method/softmax) attention that is guided by the information in these branches. Different attentions on these branches yield different sizes of the effective receptive fields of neurons in the fusion layer",
  "title": "Selective Kernel Networks",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "CaiT",
  "full_name": "Class-Attention in Image Transformers",
  "description": "**CaiT**, or **Class-Attention in Image Transformers**, is a type of [vision transformer](https://paperswithcode.com/methods/category/vision-transformer) with several design alterations upon the original [ViT](https://paperswithcode.com/method/vision-transformer). First a new layer scaling approach called [LayerScale](https://paperswithcode.com/method/layerscale) is used, adding a learnable diagonal matrix on output of each residual block, initialized close to (but not at) 0, which improves the training dynamics. Secondly, [class-attention layers](https://paperswithcode.com/method/ca) are introduced to the architecture. This creates an architecture where the transformer layers involving [self-attention](https://paperswithcode.com/method/scaled) between patches are explicitly separated from class-attention layers -- that are devoted to extract the content of the processed patches into a single vector so that it can be fed to a linear classifier.",
  "title": "Going deeper with Image Transformers",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "PSPNet",
  "full_name": "PSPNet",
  "description": "**PSPNet**, or **Pyramid Scene Parsing Network**, is a semantic segmentation model that utilises a pyramid parsing module that exploits global context information by different-region based context aggregation. The local and global clues together make the final prediction more reliable. We also propose an optimization\r\n\r\nGiven an input image, PSPNet use a pretrained CNN with the dilated network strategy to extract the feature map. The final feature map size is $1/8$ of the input image. On top of the map, we use the [pyramid pooling module](https://paperswithcode.com/method/pyramid-pooling-module) to gather context information. Using our 4-level pyramid, the pooling kernels cover the whole, half of, and small portions of the image. They are fused as the global prior.\r\nThen we concatenate the prior with the original feature map in the final part of. It is followed by a [convolution](https://paperswithcode.com/method/convolution) layer to generate the final prediction map.",
  "title": "Pyramid Scene Parsing Network",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "mBART",
  "full_name": "mBART",
  "description": "**mBART** is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the [BART objective](https://paperswithcode.com/method/bart). The input texts are noised by masking phrases and permuting sentences, and a single [Transformer model](https://paperswithcode.com/method/transformer) is learned to recover the texts. Different from other pre-training approaches for machine translation, mBART pre-trains a complete autoregressive [Seq2Seq](https://paperswithcode.com/method/seq2seq) model. mBART is trained once for all languages, providing a set of parameters that can be fine-tuned for any of the language pairs in both supervised and unsupervised settings, without any task-specific or language-specific modifications or initialization schemes.",
  "title": "Multilingual Denoising Pre-training for Neural Machine Translation",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "TanhExp",
  "full_name": "Tanh Exponential Activation Function",
  "description": "Lightweight or mobile neural networks used for real-time computer vision tasks contain fewer parameters than normal\r\nnetworks, which lead to a constrained performance. In this work, we proposed a novel activation function named Tanh Exponential\r\nActivation Function (TanhExp) which can improve the performance for these networks on image classification task significantly.\r\nThe definition of TanhExp is $f(x) = x tanh(e^x)$. We demonstrate the simplicity, efficiency, and robustness of TanhExp on various\r\ndatasets and network models and TanhExp outperforms its counterparts in both convergence speed and accuracy. Its behaviour\r\nalso remains stable even with noise added and dataset altered. We show that without increasing the size of the network, the\r\ncapacity of lightweight neural networks can be enhanced by TanhExp with only a few training epochs and no extra parameters\r\nadded.",
  "title": "TanhExp: A Smooth Activation Function with High Convergence Speed for Lightweight Neural Networks",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Auxiliary Batch Normalization",
  "full_name": "Auxiliary Batch Normalization",
  "description": "**Auxiliary Batch Normalization** is a type of regularization used in adversarial training schemes. The idea is that adversarial examples should have a separate [batch normalization](https://paperswithcode.com/method/batch-normalization) components to the clean examples, as they have different underlying statistics.",
  "title": "Adversarial Examples Improve Image Recognition",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "DTW",
  "full_name": "Dynamic Time Warping",
  "description": "Dynamic Time Warping (DTW) [1] is one of well-known distance measures between a pairwise of time series. The main idea of DTW is to compute the distance from the matching of similar elements between time series. It uses the dynamic programming technique to find the optimal temporal matching between elements of two time series.\r\n\r\nFor instance, similarities in walking could be detected using DTW, even if one person was walking faster than the other, or if there were accelerations and decelerations during the course of an observation. DTW has been applied to temporal sequences of video, audio, and graphics data — indeed, any data that can be turned into a linear sequence can be analyzed with DTW. A well known application has been automatic speech recognition, to cope with different speaking speeds. Other applications include speaker recognition and online signature recognition. It can also be used in partial shape matching application.\r\n\r\nIn general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restriction and rules:\r\n\r\n1. Every index from the first sequence must be matched with one or more indices from the other sequence, and vice versa\r\n2. The first index from the first sequence must be matched with the first index from the other sequence (but it does not have to be its only match)\r\n3. The last index from the first sequence must be matched with the last index from the other sequence (but it does not have to be its only match)\r\n4. The mapping of the indices from the first sequence to indices from the other sequence must be monotonically increasing, and vice versa, i.e. if j>i  are indices from the first sequence, then there must not be two indices l>k in the other sequence, such that index i is matched with index l and index j is matched with index k, and vice versa.\r\n\r\n[1] Sakoe, Hiroaki, and Seibi Chiba. \"Dynamic programming algorithm optimization for spoken word recognition.\" IEEE transactions on acoustics, speech, and signal processing 26, no. 1 (1978): 43-49.",
  "title": null,
  "collection": "Time Series Analysis",
  "area": "Sequential"
}
{
  "name": "Fast Sample Re-Weighting",
  "full_name": "Fast Sample Re-Weighting",
  "description": "**Fast Sample Re-Weighting**, or **FSR**, is a sample re-weighting strategy to tackle problems such as dataset biases, noisy labels and imbalanced classes. It leverages a dictionary (essentially an extra buffer) to monitor the training history reflected by the model updates during meta optimization periodically, and utilises a valuation function to discover meaningful samples from training data as the proxy of reward data. The unbiased dictionary keeps being updated and provides reward signals to optimize sample weights. Additionally, instead of maintaining model states for both model and sample weight updates separately, feature sharing is enabled for saving the computation cost used for maintaining respective states.",
  "title": "Learning Fast Sample Re-weighting Without Reward Data",
  "collection": "Sample Re-Weighting",
  "area": "General"
}
{
  "name": "StyleGAN2",
  "full_name": "StyleGAN2",
  "description": "**StyleGAN2** is a generative adversarial network that builds on [StyleGAN](https://paperswithcode.com/method/stylegan) with several improvements. First, [adaptive instance normalization](https://paperswithcode.com/method/adaptive-instance-normalization) is redesigned and replaced with a normalization technique called [weight demodulation](https://paperswithcode.com/method/weight-demodulation). Secondly, an improved training scheme upon progressively growing is introduced, which achieves the same goal - training starts by focusing on low-resolution images and then progressively shifts focus to higher and higher resolutions - without changing the network topology during training. Additionally, new types of regularization like lazy regularization and [path length regularization](https://paperswithcode.com/method/path-length-regularization) are proposed.",
  "title": "Analyzing and Improving the Image Quality of StyleGAN",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "REM",
  "full_name": "Random Ensemble Mixture",
  "description": "Random Ensemble Mixture (REM) is an easy to implement extension of [DQN](https://paperswithcode.com/method/dqn) inspired by [Dropout](https://paperswithcode.com/method/dropout). The key intuition behind REM is that if one has access to multiple estimates of Q-values, then a weighted combination of the Q-value estimates is also an estimate for Q-values. Accordingly, in each training step, REM randomly combines multiple Q-value estimates and uses this random combination for robust training.",
  "title": "An Optimistic Perspective on Offline Reinforcement Learning",
  "collection": "Q-Learning Networks",
  "area": "Reinforcement Learning"
}
{
  "name": "GEE",
  "full_name": "Generative Emotion Estimator",
  "description": "",
  "title": "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "CSPDarknet53",
  "full_name": "CSPDarknet53",
  "description": "**CSPDarknet53** is a convolutional neural network and backbone for object detection that uses [DarkNet-53](https://paperswithcode.com/method/darknet-53). It employs a CSPNet strategy to partition the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network. \r\n\r\nThis CNN is used as the backbone for [YOLOv4](https://paperswithcode.com/method/yolov4).",
  "title": "YOLOv4: Optimal Speed and Accuracy of Object Detection",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Local Contrast Normalization",
  "full_name": "Local Contrast Normalization",
  "description": "**Local Contrast Normalization** is a type of normalization that performs local subtraction and division normalizations, enforcing a sort of local competition between adjacent features in a feature map, and between features at the same spatial location in different feature maps.",
  "title": null,
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "BYOL",
  "full_name": "Bootstrap Your Own Latent",
  "description": "BYOL (Bootstrap Your Own Latent) is a new approach to self-supervised learning. BYOL’s goal is to learn a representation $y_θ$ which can then be used for downstream tasks. BYOL uses two neural networks to learn: the online and target networks. The online network is defined by a set of weights $θ$ and is comprised of three stages: an encoder $f_θ$, a projector $g_θ$ and a predictor $q_θ$. The target network has the same architecture\r\nas the online network, but uses a different set of weights $ξ$. The target network provides the regression\r\ntargets to train the online network, and its parameters $ξ$ are an exponential moving average of the\r\nonline parameters $θ$.\r\n\r\nGiven the architecture diagram on the right, BYOL minimizes a similarity loss between $q_θ(z_θ)$ and $sg(z'{_ξ})$, where $θ$ are the trained weights, $ξ$ are an exponential moving average of $θ$ and $sg$ means stop-gradient. At the end of training, everything but $f_θ$ is discarded, and $y_θ$ is used as the image representation.\r\n\r\nSource: [Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning](https://paperswithcode.com/paper/bootstrap-your-own-latent-a-new-approach-to-1)\r\n\r\nImage credit: [Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning](https://paperswithcode.com/paper/bootstrap-your-own-latent-a-new-approach-to-1)",
  "title": "Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "OHEM",
  "full_name": "Online Hard Example Mining",
  "description": "Some object detection datasets contain an overwhelming number of easy examples and a small number of hard examples. Automatic selection of these hard examples can make training more\r\neffective and efficient. **OHEM**, or **Online Hard Example Mining**, is a bootstrapping technique that modifies [SGD](https://paperswithcode.com/method/sgd) to sample from examples in a non-uniform way depending on the current loss of each example under consideration. The method takes advantage of detection-specific problem structure in which each SGD mini-batch consists of only one or two images, but thousands of candidate examples. The candidate examples are subsampled according to a distribution\r\nthat favors diverse, high loss instances.",
  "title": "Training Region-based Object Detectors with Online Hard Example Mining",
  "collection": "Prioritized Sampling",
  "area": "General"
}
{
  "name": "Spatial Broadcast Decoder",
  "full_name": "Spatial Broadcast Decoder",
  "description": "Spatial Broadcast Decoder is an architecture that aims to improve disentangling, reconstruction accuracy, and generalization to held-out regions in data space. It provides a particularly dramatic\r\nbenefit when applied to datasets with small objects.\r\n\r\nSource: [Watters et al.](https://arxiv.org/pdf/1901.07017v2.pdf)\r\n\r\nImage source: [Watters et al.](https://arxiv.org/pdf/1901.07017v2.pdf)",
  "title": "Spatial Broadcast Decoder: A Simple Architecture for Learning Disentangled Representations in VAEs",
  "collection": "Backbone Architectures",
  "area": "Computer Vision"
}
{
  "name": "Topographic VAE",
  "full_name": "Topographic VAE",
  "description": "**Topographic VAE** is a method for efficiently training deep generative models with topographically organized latent variables. The model learns sets of approximately equivariant features (i.e. \"capsules\") directly from sequences and achieves higher likelihood on correspondingly transforming test sequences. The combined color/rotation transformation in input space $\\tau\\_{g}$ becomes encoded as a $\\mathrm{Roll}$ within the capsule dimension. The model is thus able decode unseen sequence elements by encoding a partial sequence and Rolling activations within the capsules. This resembles a commutative diagram.",
  "title": "Topographic VAEs learn Equivariant Capsules",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "V-trace",
  "full_name": "V-trace",
  "description": "**V-trace** is an off-policy actor-critic reinforcement learning algorithm that helps tackle the lag between when actions are generated by the actors and when the learner estimates the gradient. Consider a trajectory $\\left(x\\_{t}, a\\_{t}, r\\_{t}\\right)^{t=s+n}\\_{t=s}$ generated by the actor following some policy $\\mu$. We can define the $n$-steps V-trace target for $V\\left(x\\_{s}\\right)$, our value approximation at state $x\\_{s}$ as:\r\n\r\n$$ v\\_{s} = V\\left(x\\_{s}\\right) + \\sum^{s+n-1}\\_{t=s}\\gamma^{t-s}\\left(\\prod^{t-1}\\_{i=s}c\\_{i}\\right)\\delta\\_{t}V $$\r\n\r\nWhere $\\delta\\_{t}V = \\rho\\_{t}\\left(r\\_{t} + \\gamma{V}\\left(x\\_{t+1}\\right) - V\\left(x\\_{t}\\right)\\right)$ is a temporal difference algorithm for $V$, and $\\rho\\_{t} = \\text{min}\\left(\\bar{\\rho}, \\frac{\\pi\\left(a\\_{t}\\mid{x\\_{t}}\\right)}{\\mu\\left(a\\_{t}\\mid{x\\_{t}}\\right)}\\right)$ and $c\\_{i} = \\text{min}\\left(\\bar{c}, \\frac{\\pi\\left(a\\_{t}\\mid{x\\_{t}}\\right)}{\\mu\\left(a\\_{t}\\mid{x\\_{t}}\\right)}\\right)$ are truncated importance sampling weights. We assume that the truncation levels are such that $\\bar{\\rho} \\geq \\bar{c}$.",
  "title": "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures",
  "collection": "Value Function Estimation",
  "area": "Reinforcement Learning"
}
{
  "name": "BPE",
  "full_name": "Byte Pair Encoding",
  "description": "**Byte Pair Encoding**, or **BPE**, is a subword segmentation algorithm that encodes rare and unknown words as sequences of subword units. The intuition is that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations).\r\n\r\n[Lei Mao](https://leimao.github.io/blog/Byte-Pair-Encoding/) has a detailed blog post that explains how this works.",
  "title": "Neural Machine Translation of Rare Words with Subword Units",
  "collection": "Subword Segmentation",
  "area": "Natural Language Processing"
}
{
  "name": "SCST",
  "full_name": "Self-critical Sequence Training",
  "description": "",
  "title": "Self-critical Sequence Training for Image Captioning",
  "collection": "Reinforcement Learning Frameworks",
  "area": "Reinforcement Learning"
}
{
  "name": "GPT-3",
  "full_name": "GPT-3",
  "description": "**GPT-3** is an autoregressive [transformer](https://paperswithcode.com/methods/category/transformers)  model with 175 billion\r\nparameters. It uses the same architecture/model as [GPT-2](https://paperswithcode.com/method/gpt-2), including the modified initialization, pre-normalization, and reversible tokenization, with the exception that GPT-3 uses alternating dense and locally banded sparse attention patterns in the layers of the [transformer](https://paperswithcode.com/method/transformer), similar to the [Sparse Transformer](https://paperswithcode.com/method/sparse-transformer).",
  "title": "Language Models are Few-Shot Learners",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Fractal Block",
  "full_name": "Fractal Block",
  "description": "A **Fractal Block** is an image model block that utilizes an expansion rule that yields a structural layout of truncated fractals. For the base case where $f\\_{1}\\left(z\\right) = \\text{conv}\\left(z\\right)$ is a convolutional layer, we then have recursive fractals of the form:\r\n\r\n$$ f\\_{C+1}\\left(z\\right) = \\left[\\left(f\\_{C}\\circ{f\\_{C}}\\right)\\left(z\\right)\\right] \\oplus \\left[\\text{conv}\\left(z\\right)\\right]$$\r\n\r\nWhere $C$ is the number of columns. For the join layer (green in Figure), we use the element-wise mean rather than concatenation or addition.",
  "title": "FractalNet: Ultra-Deep Neural Networks without Residuals",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Low Rank Tensor Learning Paradigms",
  "full_name": "Time-homogenuous Top-K Ranking",
  "description": "Please enter a description about the method here",
  "title": null,
  "collection": "Time Series Analysis",
  "area": "Sequential"
}
{
  "name": "SimCLRv2",
  "full_name": "SimCLRv2",
  "description": "**SimCLRv2** is a semi-supervised learning method for learning from few labeled examples while making best use of a large amount of unlabeled data. It is a modification of a recently proposed contrastive learning framework, [SimCLR](https://www.paperswithcode.com/method/simclr). It improves upon it in three major ways:\r\n\r\n1. To fully leverage the power of general pre-training, larger [ResNet](https://paperswithcode.com/method/resnet) models are explored. Unlike SimCLR and other previous work, whose largest model is ResNet-50 (4×), SimCLRv2 trains models that are deeper but less wide. The largest model trained is a 152 layer ResNet with 3× wider channels and [selective kernels](https://paperswithcode.com/method/selective-kernel-convolution) (SK), a channel-wise attention mechanism that improves the parameter efficiency of the network. By scaling up the model from ResNet-50 to ResNet-152 (3×+SK), a 29% relative improvement is obtained in top-1 accuracy when fine-tuned on 1% of labeled examples.\r\n\r\n2. The capacity of the non-linear network $g(·)$ (a.k.a. projection head) is increased, by making it deeper. Furthermore, instead of throwing away $g(·)$ entirely after pre-training as in SimCLR, fine-tuning occurs from a middle layer. This small change yields a significant improvement for both linear evaluation and fine-tuning with only a few labeled examples. Compared to SimCLR with 2-layer projection head, by using a 3-layer projection head and fine-tuning from the 1st layer of projection head, it results in as much as 14% relative improvement in top-1 accuracy when fine-tuned on 1% of labeled examples.\r\n\r\n3. The memory mechanism of [MoCo v2](https://paperswithcode.com/method/moco-v2) is incorporated, which designates a memory network (with a moving average of weights for stabilization) whose output will be buffered as negative examples. Since training is based on large mini-batch which already supplies many contrasting negative examples, this change yields an improvement of ∼1% for linear evaluation as well as when fine-tuning on 1% of labeled examples.",
  "title": "Big Self-Supervised Models are Strong Semi-Supervised Learners",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "TransferQA",
  "full_name": "TransferQA",
  "description": "**TransferQA** is a transferable generative QA model, built upon [T5](https://paperswithcode.com/method/t5) that combines extractive QA and multi-choice QA via a text-to-text [transformer](https://paperswithcode.com/method/transformer) framework, and tracks both categorical slots and non-categorical slots in DST. In addition, it introduces two effective ways to construct unanswerable questions, namely, negative question sampling and context truncation, which enable the model to handle “none” value slots in the zero-shot DST setting.",
  "title": "Zero-Shot Dialogue State Tracking via Cross-Task Transfer",
  "collection": "Question Answering Models",
  "area": "Natural Language Processing"
}
{
  "name": "BIDeN",
  "full_name": "Blind Image Decomposition Network",
  "description": "**BIDeN**, or **Blind Image Decomposition Network**, is a model for blind image decomposition, which requires separating a superimposed image into constituent underlying images in a blind setting, that is, both the source components involved in mixing as well as the mixing mechanism are unknown.  For example, rain may consist of multiple components, such as rain streaks, raindrops, snow, and haze. \r\n\r\nThe Figure shows an example where $N = 4, L = 2, x = {a, b, c, d}$, and $I = {1, 3}$. $a, c$ are selected then passed to the mixing function $f$, and outputs the mixed input image $z$, which is $f\\left(a, c\\right)$ here. The generator consists of an encoder $E$ with three branches and multiple heads $H$. $\\bigotimes$ denotes the concatenation operation. Depth and receptive field of each branch is different to capture multiple scales of features. Each specified head points to the corresponding source component, and the number of heads varies with the maximum number of source components N. All reconstructed images $\\left(a', c'\\right)$ and their corresponding real images $\\left(a, c\\right)$ are sent to an unconditional discriminator. The discriminator also predicts the source components of the input image $z$. The outputs from other heads $\\left(b', d'\\right)$ do not contribute to the optimization.",
  "title": "Blind Image Decomposition",
  "collection": "Image Decomposition Models",
  "area": "Computer Vision"
}
{
  "name": "Switchable Normalization",
  "full_name": "Switchable Normalization",
  "description": "**Switchable Normalization** combines three types of statistics estimated channel-wise, layer-wise, and minibatch-wise by using [instance normalization](https://paperswithcode.com/method/instance-normalization), [layer normalization](https://paperswithcode.com/method/layer-normalization), and [batch normalization](https://paperswithcode.com/method/batch-normalization) respectively. [Switchable Normalization](https://paperswithcode.com/method/switchable-normalization) switches among them by learning their importance weights.",
  "title": "Differentiable Learning-to-Normalize via Switchable Normalization",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "3D SA",
  "full_name": "3 Dimensional Soft Attention",
  "description": "",
  "title": "Attention based Writer Independent Handwriting Verification",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Multi-Head Linear Attention",
  "full_name": "Multi-Head Linear Attention",
  "description": "**Multi-Head Linear Attention** is a type of linear multi-head self-attention module, proposed with the [Linformer](https://paperswithcode.com/method/linformer) architecture. The main idea is to add two linear projection matrices $E\\_{i}, F\\_{i} \\in \\mathbb{R}^{n\\times{k}}$ when computing key and value. We first project the original $\\left(n \\times d\\right)$-dimensional key and value layers $KW\\_{i}^{K}$ and $VW\\_{i}^{V}$ into $\\left(k\\times{d}\\right)$-dimensional projected key and value layers. We then compute a $\\left(n\\times{k}\\right)$ dimensional context mapping $\\bar{P}$ using scaled-dot product attention:\r\n\r\n$$ \\bar{\\text{head}\\_{i}} = \\text{Attention}\\left(QW^{Q}\\_{i}, E\\_{i}KW\\_{i}^{K}, F\\_{i}VW\\_{i}^{V}\\right) $$\r\n\r\n$$ \\bar{\\text{head}\\_{i}} = \\text{softmax}\\left(\\frac{QW^{Q}\\_{i}\\left(E\\_{i}KW\\_{i}^{K}\\right)^{T}}{\\sqrt{d\\_{k}}}\\right) \\cdot F\\_{i}VW\\_{i}^{V} $$\r\n\r\nFinally, we compute context embeddings for each head using $\\bar{P} \\cdot \\left(F\\_{i}{V}W\\_{i}^{V}\\right)$.",
  "title": "Linformer: Self-Attention with Linear Complexity",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "SqueezeBERT",
  "full_name": "SqueezeBERT",
  "description": "**SqueezeBERT** is an efficient architectural variant of [BERT](https://paperswithcode.com/method/bert) for natural language processing that uses [grouped convolutions](https://paperswithcode.com/method/grouped-convolution). It is much like BERT-base, but with positional feedforward connection layers implemented as convolutions, and grouped [convolution](https://paperswithcode.com/method/convolution) for many of the layers.",
  "title": "SqueezeBERT: What can computer vision teach NLP about efficient neural networks?",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Bottleneck Residual Block",
  "full_name": "Bottleneck Residual Block",
  "description": "A **Bottleneck Residual Block** is a variant of the [residual block](https://paperswithcode.com/method/residual-block) that utilises 1x1 convolutions to create a bottleneck. The use of a bottleneck reduces the number of parameters and matrix multiplications. The idea is to make residual blocks as thin as possible to increase depth and have less parameters. They were introduced as part of the [ResNet](https://paperswithcode.com/method/resnet) architecture, and are used as part of deeper ResNets such as ResNet-50 and ResNet-101.",
  "title": "Deep Residual Learning for Image Recognition",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "LeViT Attention Block",
  "full_name": "LeViT Attention Block",
  "description": "**LeViT Attention Block** is a module used for [attention](https://paperswithcode.com/methods/category/attention-mechanisms) in the [LeViT](https://paperswithcode.com/method/levit) architecture. Its main feature is providing positional information within each attention block, i.e. where we explicitly inject relative position information in the attention mechanism. This is achieved by adding an attention bias to the attention maps.",
  "title": "LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "ATSS",
  "full_name": "Adaptive Training Sample Selection",
  "description": "**Adaptive Training Sample Selection**, or **ATSS**, is a method to automatically select positive and negative samples according to statistical characteristics of object. It bridges the gap between anchor-based and anchor-free detectors. \r\n\r\nFor each ground-truth box $g$ on the image, we first find out its candidate positive samples. As described in Line $3$ to $6$, on each pyramid level, we select $k$ anchor boxes whose center are closest to the center of $g$ based on L2 distance. Supposing there are $\\mathcal{L}$ feature pyramid levels, the ground-truth box $g$ will have $k\\times\\mathcal{L}$ candidate positive samples. After that, we compute the IoU between these candidates and the ground-truth $g$ as $\\mathcal{D}_g$ in Line $7$, whose mean and standard deviation are computed as $m_g$ and $v_g$ in Line $8$ and Line $9$. With these statistics, the IoU threshold for this ground-truth $g$ is obtained as $t_g=m_g+v_g$ in Line $10$. Finally, we select these candidates whose IoU are greater than or equal to the threshold $t_g$ as final positive samples in Line $11$ to $15$. \r\n\r\nNotably ATSS also limits the positive samples' center to the ground-truth box as shown in Line $12$. Besides, if an anchor box is assigned to multiple ground-truth boxes, the one with the highest IoU will be selected. The rest are negative samples.",
  "title": "Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection",
  "collection": "Prioritized Sampling",
  "area": "General"
}
{
  "name": "Discriminative Regularization",
  "full_name": "Discriminative Regularization",
  "description": "**Discriminative Regularization** is a regularization technique for [variational autoencoders](https://paperswithcode.com/methods/category/likelihood-based-generative-models) that uses representations from discriminative classifiers to augment the [VAE](https://paperswithcode.com/method/vae) objective function (the lower bound) corresponding to a generative model. Specifically, it encourages the model’s reconstructions to be close to the data example in a representation space defined by the hidden layers of highly-discriminative, neural network based classifiers.",
  "title": "Discriminative Regularization for Generative Models",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "StyleMapGAN",
  "full_name": "StyleMapGAN",
  "description": "**StyleMapGAN** is a generative adversarial network for real-time image editing. The intermediate latent space has spatial dimensions, and a spatially variant modulation replaces [AdaIN](https://paperswithcode.com/method/adaptive-instance-normalization). It aims to make the embedding through an encoder more accurate than existing optimization-based methods while maintaining the properties of GANs.",
  "title": "Exploiting Spatial Dimensions of Latent in GAN for Real-time Image Editing",
  "collection": "Generative Adversarial Networks",
  "area": "Computer Vision"
}
{
  "name": "SPIN",
  "full_name": "Spatiotemporal Point Inference Network",
  "description": "",
  "title": "Learning to Reconstruct Missing Data from Spatiotemporal Graphs with Sparse Observations",
  "collection": "Graph Representation Learning",
  "area": "Graphs"
}
{
  "name": "GraphCL",
  "full_name": "Graph contrastive learning with augmentations",
  "description": "",
  "title": "Graph Contrastive Learning with Augmentations",
  "collection": "Graph Representation Learning",
  "area": "Graphs"
}
{
  "name": "GECO",
  "full_name": "Generalized ELBO with Constrained Optimization",
  "description": "",
  "title": "Taming VAEs",
  "collection": "Variational Optimization",
  "area": "General"
}
{
  "name": "DV3 Convolution Block",
  "full_name": "DV3 Convolution Block",
  "description": "**DV3 Convolution Block** is a convolutional block used for the [Deep Voice 3](https://paperswithcode.com/method/deep-voice-3) text-to-speech architecture. It consists of a 1-D [convolution](https://paperswithcode.com/method/convolution) with a gated linear unit and a [residual connection](https://paperswithcode.com/method/residual-connection). In the Figure, $c$ denotes the dimensionality of the input. The convolution output of size $2 \\cdot c$ is split into equal-sized portions: the gate vector and the input vector. A scaling factor $\\sqrt{0.5}$ is used to ensure that we preserve the input variance early in training. The gated linear unit provides a linear path for the gradient flow, which alleviates the vanishing gradient issue for stacked convolution blocks while retaining non-linearity. To introduce speaker-dependent control, a speaker-dependent embedding is added as a bias to the convolution filter output, after a softsign function. The authors use the softsign nonlinearity because it limits the range of the output while also avoiding the saturation problem that exponential based nonlinearities sometimes exhibit. Convolution filter weights are initialized with zero-mean and unit-variance activations throughout the entire network.",
  "title": "Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning",
  "collection": "Audio Model Blocks",
  "area": "Audio"
}
{
  "name": "CHM",
  "full_name": "Convolutional Hough Matching",
  "description": "**Convolutional Hough Matching**, or **CHM**, is a geometric matching algorithm that distributes similarities of candidate matches over a geometric transformation space and evaluates them in a convolutional manner. It is casted into a trainable neural layer with a  semi-isotropic high-dimensional kernel, which learns non-rigid matching with a small number of interpretable parameters.",
  "title": "Convolutional Hough Matching Networks for Robust and Efficient Visual Correspondence",
  "collection": "Geometric Matching",
  "area": "General"
}
{
  "name": "Glow-TTS",
  "full_name": "Glow-TTS",
  "description": "**Glow-TTS** is a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech.  The model is directly trained to maximize the log-likelihood of speech with the alignment. Enforcing hard monotonic alignments helps enable robust TTS, which generalizes to long utterances, and employing flows enables fast, diverse, and controllable speech synthesis.",
  "title": "Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search",
  "collection": "Text-to-Speech Models",
  "area": "Audio"
}
{
  "name": "CT3D",
  "full_name": "CT3D",
  "description": "**CT3D** is a two-stage 3D object detection framework that leverages a high-quality region proposal network and a Channel-wise [Transformer](https://paperswithcode.com/method/transformer) architecture. The proposed CT3D simultaneously performs proposal-aware embedding and channel-wise context aggregation for the point features within each proposal. Specifically, CT3D uses a proposal's keypoints for spatial contextual modelling and learns attention propagation in the encoding module, mapping the proposal to point embeddings. Next, a new channel-wise decoding module enriches the query-key interaction via channel-wise re-weighting to effectively merge multi-level contexts, which contributes to more accurate object predictions. \r\n\r\nIn CT3D, the raw points are first fed into the [RPN](https://paperswithcode.com/method/rpn) for generating 3D proposals. Then the raw points along with the corresponding proposals are processed by the channel-wise Transformer composed of the proposal-to-point encoding module and the channel-wise decoding module. Specifically, the proposal-to-point encoding module is to modulate each point feature with global proposal-aware context information. After that, the encoded point features are transformed into an effective proposal feature representation by the\r\nchannel-wise decoding module for confidence prediction and box regression.",
  "title": "Improving 3D Object Detection with Channel-wise Transformer",
  "collection": "3D Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "DeLighT Block",
  "full_name": "DeLighT Block",
  "description": "A **DeLighT Block** is a block used in the [DeLighT](https://paperswithcode.com/method/delight) [transformer](https://paperswithcode.com/method/transformer) architecture. It uses a [DExTra](https://paperswithcode.com/method/dextra) transformation to reduce the dimensionality of the vectors entered into the attention layer, where a [single-headed attention](https://paperswithcode.com/method/single-headed-attention) module is used.  Since the DeLighT block learns wider representations of the input across different layers using DExTra, it enables the authors to replace [multi-head attention](https://paperswithcode.com/method/multi-head-attention) with single-head attention. This is then followed by a light-weight FFN which, rather than expanding the dimension (as in normal Transformers which widen to a dimension 4x the size), imposes a bottleneck and squeezes the dimensions. Again, the reason for this is that the DExTra transformation has already incorporated wider representations so we can squeeze instead at this layer.",
  "title": "DeLighT: Deep and Light-weight Transformer",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "Embedded Gaussian Affinity",
  "full_name": "Embedded Gaussian Affinity",
  "description": "**Embedded Gaussian Affinity** is a type of affinity or self-similarity function between two points $\\mathbf{x\\_{i}}$ and $\\mathbf{x\\_{j}}$ that uses a Gaussian function in an embedding space:\r\n\r\n$$ f\\left(\\mathbf{x\\_{i}}, \\mathbf{x\\_{j}}\\right) = e^{\\theta\\left(\\mathbf{x\\_{i}}\\right)^{T}\\phi\\left(\\mathbf{x\\_{j}}\\right)} $$\r\n\r\nHere $\\theta\\left(x\\_{i}\\right) = W\\_{θ}x\\_{i}$ and $\\phi\\left(x\\_{j}\\right) = W\\_{φ}x\\_{j}$ are two embeddings.\r\n\r\nNote that the self-attention module used in the original [Transformer](https://paperswithcode.com/method/transformer) model is a special case of non-local operations in the embedded Gaussian version. This can be seen from the fact that for a given $i$, $\\frac{1}{\\mathcal{C}\\left(\\mathbf{x}\\right)}\\sum\\_{\\forall{j}}f\\left(\\mathbf{x}\\_{i}, \\mathbf{x}\\_{j}\\right)g\\left(\\mathbf{x}\\_{j}\\right)$ becomes the [softmax](https://paperswithcode.com/method/softmax) computation along the dimension $j$. So we have $\\mathbf{y} = \\text{softmax}\\left(\\mathbf{x}^{T}W^{T}\\_{\\theta}W\\_{\\phi}\\mathbf{x}\\right)g\\left(\\mathbf{x}\\right)$, which is the self-attention form in the Transformer model. This shows how we can relate this recent self-attention model to the classic computer vision method of non-local means.",
  "title": "Non-local Neural Networks",
  "collection": "Affinity Functions",
  "area": "General"
}
{
  "name": "DVD-GAN GBlock",
  "full_name": "DVD-GAN GBlock",
  "description": "**DVD-GAN GBlock** is a [residual block](https://paperswithcode.com/method/residual-block) for the generator used in the [DVD-GAN](https://paperswithcode.com/method/dvd-gan) architecture for video generation.",
  "title": "Adversarial Video Generation on Complex Datasets",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "DAPO",
  "full_name": "Dialogue-Adaptive Pre-training Objective",
  "description": "**Dialogue-Adaptive Pre-training Objective (DAPO)** is a pre-training objective for dialogue adaptation, which is designed to measure qualities of dialogues from multiple important aspects, like Readability, Consistency and Fluency which have already been focused on by general LM pre-training objectives, and those also significant for assessing dialogues but ignored by general LM pre-training objectives, like Diversity and Specificity.",
  "title": "Dialogue-adaptive Language Model Pre-training From Quality Estimation",
  "collection": "Dialog Adaptation",
  "area": "Natural Language Processing"
}
{
  "name": "BRepNet",
  "full_name": "BRepNet",
  "description": "**BRepNet** is a neural network for CAD applications. It is designed to operate directly on B-rep data structures, avoiding the need to approximate the model as meshes or point clouds. BRepNet defines convolutional kernels with respect to oriented coedges in the data structure. In the neighborhood of each coedge, a small collection of faces, edges and coedges can be identified and patterns in the feature vectors from these entities detected by specific learnable parameters.",
  "title": "BRepNet: A topological message passing system for solid models",
  "collection": "CAD Design Models",
  "area": "Computer Vision"
}
{
  "name": "BLOOM",
  "full_name": "BLOOM",
  "description": "**BLOOM** is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of\r\nsources in 46 natural and 13 programming languages (59 in total).",
  "title": "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "CP-N3-RP",
  "full_name": "CP with N3 Regularizer and Relation Prediction",
  "description": "CP with N3 Regularizer and Relation Prediction",
  "title": "Relation Prediction as an Auxiliary Training Objective for Improving Multi-Relational Graph Representations",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "Non Maximum Suppression",
  "full_name": "Non Maximum Suppression",
  "description": "**Non Maximum Suppression** is a computer vision method that selects a single entity out of many overlapping entities (for example bounding boxes in object detection). The criteria is usually discarding entities that are below a given probability bound. With remaining entities we repeatedly pick the entity with the highest probability, output that as the prediction, and discard any remaining box where a $\\text{IoU} \\geq 0.5$ with the box output in the previous step.\r\n\r\nImage Credit: [Martin Kersner](https://github.com/martinkersner/non-maximum-suppression-cpp)",
  "title": null,
  "collection": "Proposal Filtering",
  "area": "Computer Vision"
}
{
  "name": "AMSBound",
  "full_name": "AMSBound",
  "description": "**AMSBound** is a variant of the [AMSGrad](https://paperswithcode.com/method/amsgrad) stochastic optimizer which is designed to be more robust to extreme learning rates. Dynamic bounds are employed on learning rates, where the lower and upper bound are initialized as zero and infinity respectively, and they both smoothly converge to a constant final step size. AMSBound can be regarded as an adaptive method at the beginning of training, and it gradually and smoothly transforms to [SGD](https://paperswithcode.com/method/sgd) (or with momentum) as time step increases. \r\n\r\n$$ g\\_{t} = \\nabla{f}\\_{t}\\left(x\\_{t}\\right) $$\r\n\r\n$$ m\\_{t} = \\beta\\_{1t}m\\_{t-1} + \\left(1-\\beta\\_{1t}\\right)g\\_{t} $$\r\n\r\n$$ v\\_{t} = \\beta\\_{2}v\\_{t-1} + \\left(1-\\beta\\_{2}\\right)g\\_{t}^{2}$$\r\n\r\n$$ \\hat{v}\\_{t} = \\max\\left(\\hat{v}\\_{t-1}, v\\_{t}\\right) \\text{ and } V\\_{t} = \\text{diag}\\left(\\hat{v}\\_{t}\\right) $$\r\n\r\n$$ \\eta = \\text{Clip}\\left(\\alpha/\\sqrt{V\\_{t}}, \\eta\\_{l}\\left(t\\right), \\eta\\_{u}\\left(t\\right)\\right) \\text{ and } \\eta\\_{t} = \\eta/\\sqrt{t} $$\r\n\r\n$$ x\\_{t+1} = \\Pi\\_{\\mathcal{F}, \\text{diag}\\left(\\eta\\_{t}^{-1}\\right)}\\left(x\\_{t} - \\eta\\_{t} \\odot m\\_{t} \\right) $$\r\n\r\nWhere $\\alpha$ is the initial step size, and $\\eta_{l}$ and $\\eta_{u}$ are the lower and upper bound functions respectively.",
  "title": "Adaptive Gradient Methods with Dynamic Bound of Learning Rate",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "MoGA-B",
  "full_name": "MoGA-B",
  "description": "**MoGA-B** is a convolutional neural network optimized for mobile latency and discovered via Mobile GPU-Aware (MoGA) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search). The basic building block is MBConvs (inverted residual blocks) from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2). Squeeze-and-excitation layers are also experimented with.",
  "title": "MoGA: Searching Beyond MobileNetV3",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "LOGAN",
  "full_name": "LOGAN",
  "description": "**LOGAN** is a generative adversarial network that uses a latent optimization approach using [natural gradient descent](https://paperswithcode.com/method/natural-gradient-descent) (NGD). For the Fisher matrix in NGD, the authors use the empirical Fisher $F'$ with Tikhonov damping:\r\n\r\n$$ F' = g \\cdot g^{T} + \\beta{I} $$\r\n\r\nThey also use Euclidian Norm regularization for the optimization step.\r\n\r\nFor LOGAN's base architecture, [BigGAN-deep](https://paperswithcode.com/method/biggan-deep) is used with a few modifications: increasing the size of the latent source from $186$ to $256$, to compensate the randomness of the source lost\r\nwhen optimising $z$. 2, using the uniform distribution $U\\left(−1, 1\\right)$ instead of the standard normal distribution $N\\left(0, 1\\right)$ for $p\\left(z\\right)$ to be consistent with the clipping operation, using  leaky [ReLU](https://paperswithcode.com/method/relu) (with the slope of 0.2 for the negative part) instead of ReLU as the non-linearity for smoother gradient flow for $\\frac{\\delta{f}\\left(z\\right)}{\\delta{z}}$ .",
  "title": "LOGAN: Latent Optimisation for Generative Adversarial Networks",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Bi-attention",
  "full_name": "Bilinear Attention",
  "description": "Bi-attention employs the attention-in-attention (AiA) mechanism to capture second-order statistical information: the outer point-wise channel attention vectors are computed from the output of the inner channel attention.",
  "title": "Bilinear Attention Networks for Person Retrieval",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Differentiable Hyperparameter Search",
  "full_name": "Differentiable Hyperparameter Search",
  "description": "Differentiable simultaneous optimization of hyperparameters and neural network architecture. Also a [Neural Architecture Search](https://paperswithcode.com/method/neural-architecture-search) (NAS) method.",
  "title": "sharpDARTS: Faster and More Accurate Differentiable Architecture Search",
  "collection": "Hyperparameter Search",
  "area": "General"
}
{
  "name": "TD-Gammon",
  "full_name": "TD-Gammon",
  "description": "**TD-Gammon** is a game-learning architecture for playing backgammon. It involves the use of a $TD\\left(\\lambda\\right)$ learning algorithm and a feedforward neural network.\r\n\r\nCredit: [Temporal Difference Learning and\r\nTD-Gammon](https://cling.csd.uwo.ca/cs346a/extra/tdgammon.pdf)",
  "title": null,
  "collection": "Board Game Models",
  "area": "Reinforcement Learning"
}
{
  "name": "RAM",
  "full_name": "Recurrent models of visual attention",
  "description": "RAM adopts RNNs and reinforcement learning (RL) to make the network learn where to pay attention.",
  "title": "Recurrent Models of Visual Attention",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "CoVR",
  "full_name": "Composed Video Retrieval",
  "description": "The composed video retrieval (CoVR) task is a new task, where the goal is to find a video that matches both a query image and a query text. The query image represents a visual concept that the user is interested in, and the query text specifies how the concept should be modified or refined. For example, given an image of a fountain and the text _during show at night_, the CoVR task is to retrieve a video that shows the fountain at night with a show.",
  "title": "CoVR: Learning Composed Video Retrieval from Web Video Captions",
  "collection": "Video-Text Retrieval Models",
  "area": "Computer Vision"
}
{
  "name": "Rendezvous",
  "full_name": "Multi-head of Mixed Attention",
  "description": "Multi-heads of both self and cross attentions",
  "title": "Rendezvous: Attention Mechanisms for the Recognition of Surgical Action Triplets in Endoscopic Videos",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Transposed convolution",
  "full_name": "Transposed convolution",
  "description": "",
  "title": "Fully Convolutional Networks for Semantic Segmentation",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "ED-GNN",
  "full_name": "Medical Entity Disambiguation using Graph Neural Networks",
  "description": "",
  "title": "Medical Entity Disambiguation Using Graph Neural Networks",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "ProphetNet",
  "full_name": "ProphetNet",
  "description": "**ProphetNet** is a sequence-to-sequence pre-training model that introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of optimizing one-step-ahead prediction in the traditional sequence-to-sequence model, the ProphetNet is optimized by $n$-step ahead prediction that predicts the next $n$ tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and further help predict multiple future tokens.",
  "title": "ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "SpecGAN",
  "full_name": "SpecGAN",
  "description": "**SpecGAN** is a generative adversarial network method for spectrogram-based, frequency-domain audio generation. The problem is suited for GANs designed for image generation. The model can be approximately inverted. \r\n\r\nTo process audio into suitable spectrograms, the authors perform the short-time Fourier transform with 16 ms windows and 8ms stride, resulting in 128 frequency bins, linearly spaced from 0 to 8 kHz. They take the magnitude of the resultant spectra and scale amplitude values logarithmically to better-align with human perception. They then normalize each frequency bin to have zero mean and unit variance. They clip the spectra to $3$ standard deviations and rescale to $\\left[−1, 1\\right]$.\r\n\r\nThey then use the [DCGAN](https://paperswithcode.com/method/dcgan) approach on the result spectra.",
  "title": "Adversarial Audio Synthesis",
  "collection": "Generative Audio Models",
  "area": "Audio"
}
{
  "name": "Differential Diffusion",
  "full_name": "Differential Diffusion",
  "description": "**Differential Diffusion** is an enhancement of image-to-image diffusion models that adds the ability to control the amount of change applied to each image fragment via a change map.",
  "title": "Differential Diffusion: Giving Each Pixel Its Strength",
  "collection": "Image Generation Models",
  "area": "Computer Vision"
}
{
  "name": "Contractive Autoencoder",
  "full_name": "Contractive Autoencoder",
  "description": "A **Contractive Autoencoder** is an autoencoder that adds a penalty term to the classical reconstruction cost function. This penalty term corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input. This penalty term results in a localized space contraction which in turn yields robust features on the activation layer. The penalty helps to carve a representation that better captures the local directions of variation dictated by the data, corresponding to a lower-dimensional non-linear manifold, while being more invariant to the vast majority of directions orthogonal to the manifold.",
  "title": null,
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "ShuffleNet v2",
  "full_name": "ShuffleNet v2",
  "description": "**ShuffleNet v2** is a convolutional neural network optimized for a direct metric (speed) rather than indirect metrics like FLOPs. It builds upon [ShuffleNet v1](https://paperswithcode.com/method/shufflenet), which utilised pointwise group convolutions, bottleneck-like structures, and a [channel shuffle](https://paperswithcode.com/method/channel-shuffle) operation. Differences are shown in the Figure to the right, including a new channel split operation and moving the channel shuffle operation further down the block.",
  "title": "ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "PNAS",
  "full_name": "Progressive Neural Architecture Search",
  "description": "**Progressive Neural Architecture Search**, or **PNAS**, is a method for learning the structure of convolutional neural networks (CNNs). It uses a sequential model-based optimization (SMBO) strategy, where we search the space of cell structures, starting with simple (shallow) models and progressing to complex ones, pruning out unpromising structures as we go. \r\n\r\nAt iteration $b$ of the algorithm, we have a set of $K$ candidate cells (each of size $b$ blocks), which we train and evaluate on a dataset of interest. Since this process is expensive, PNAS also learns a model or surrogate function which can predict the performance of a structure without needing to train it. We then expand the $K$ candidates of size $b$ into $K' \\gg K$ children, each of size $b+1$. The surrogate function is used to rank all of the $K'$ children, pick the top $K$, and then train and evaluate them. We continue in this way until $b=B$, which is the maximum number of blocks we want to use in a cell.",
  "title": "Progressive Neural Architecture Search",
  "collection": "Neural Architecture Search",
  "area": "General"
}
{
  "name": "StreaMRAK",
  "full_name": "StreaMRAK",
  "description": "**StreaMRAK** is a streaming version of kernel ridge regression. It divdes the problem into several levels of resolution, which allows continual refinement to the predictions.",
  "title": "StreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm",
  "collection": "Kernel Methods",
  "area": "General"
}
{
  "name": "CenterTrack",
  "full_name": "Track objects as points",
  "description": "Our tracker, CenterTrack, applies a detection model to a pair of images and detections from the prior frame. Given this minimal input, CenterTrack localizes objects and predicts their associations with the previous frame. That's it. CenterTrack is simple, online (no peeking into the future), and real-time.",
  "title": "Tracking Objects as Points",
  "collection": "Multi-Object Tracking Models",
  "area": "Computer Vision"
}
{
  "name": "Copy-Paste",
  "full_name": "simple Copy-Paste",
  "description": "",
  "title": "Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "StereoLayers",
  "full_name": "StereoLayers",
  "description": "",
  "title": "Stereo Magnification with Multi-Layer Images",
  "collection": "3D Representations",
  "area": "Computer Vision"
}
{
  "name": "Spectral-Normalized Identity Priors",
  "full_name": "Spectral-Normalized Identity Priors",
  "description": "**Spectral-Normalized Identity Priors**, or **SNIP**, is a structured pruning approach that penalizes an entire [residual module](https://paperswithcode.com/method/residual-connection) in a [Transformer model](https://paperswithcode.com/method/residual-connection) toward an identity mapping. It is applicable to any structured module, including a single [attention head](https://paperswithcode.com/method/scaled), an [entire attention block](https://paperswithcode.com/method/multi-head-attention), or a [feed-forward subnetwork](https://paperswithcode.com/method/position-wise-feed-forward-layer). The method identifies and discards unimportant non-linear mappings in the [residual connections](https://paperswithcode.com/method/residual-connection) by applying a thresholding operator on the function norm. Furthermore, [spectral normalization](https://paperswithcode.com/method/spectral-normalization) to stabilize the distribution of the post-activation values of the [Transformer](https://paperswithcode.com/method/transformer) layers, further improving the pruning effectiveness of the proposed methodology.",
  "title": "Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior",
  "collection": "Pruning",
  "area": "General"
}
{
  "name": "Informative Sample Mining Network",
  "full_name": "Informative Sample Mining Network",
  "description": "**Informative Sample Mining Network** is a multi-stage sample training scheme for GANs to reduce sample hardness while preserving sample informativeness. Adversarial Importance Weighting is proposed to select informative samples and assign them greater weight. The authors also propose Multi-hop Sample Training to avoid the potential problems in model training caused by sample mining. Based on the principle of divide-and-conquer, the authors produce target images by multiple hops, which means the image translation is decomposed into several separated steps.",
  "title": "Informative Sample Mining Network for Multi-Domain Image-to-Image Translation",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "NoisyNet-A3C",
  "full_name": "NoisyNet-A3C",
  "description": "**NoisyNet-A3C** is a modification of [A3C](https://paperswithcode.com/method/a3c) that utilises noisy linear layers for exploration instead of \r\n$\\epsilon$-greedy exploration as in the original [DQN](https://paperswithcode.com/method/dqn) formulation.",
  "title": "Noisy Networks for Exploration",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "Soft Actor-Critic (Autotuned Temperature)",
  "full_name": "Soft Actor-Critic (Autotuned Temperature)",
  "description": "**Soft Actor Critic (Autotuned Temperature** is a modification of the [SAC](https://paperswithcode.com/method/soft-actor-critic) reinforcement learning algorithm. [SAC](https://paperswithcode.com/method/sac) can suffer from brittleness to the temperature hyperparameter. Unlike in conventional reinforcement learning, where the optimal policy is independent of scaling of the reward function, in maximum entropy reinforcement learning the scaling factor has to be compensated by the choice a of suitable temperature, and a sub-optimal temperature can drastically degrade performance. To resolve this issue, SAC with Autotuned Temperature has an automatic gradient-based temperature tuning method that adjusts the expected entropy over the visited states to match a target value.",
  "title": "Soft Actor-Critic Algorithms and Applications",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "SLR",
  "full_name": "Surrogate Lagrangian Relaxation",
  "description": "Please enter a description about the method here",
  "title": "Enabling Retrain-free Deep Neural Network Pruning using Surrogate Lagrangian Relaxation",
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "HiSD",
  "full_name": "Hierarchical Style Disentanglement",
  "description": "**Hierarchical Style Disentanglement**, or **HiSD**,  aims to disentangle different styles in image-to-image translation models. It organizes the labels into a hierarchical structure, where independent tags, exclusive attributes, and disentangled styles are allocated from top to bottom. To make the styles identified to the tags and attributes, the authors carefully redesign the modules, phases, and objectives.",
  "title": "Image-to-image Translation via Hierarchical Style Disentanglement",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Image Scale Augmentation",
  "full_name": "Image Scale Augmentation",
  "description": "Image Scale Augmentation is an augmentation technique where we randomly pick the short size of a image within a dimension range. One use case of this augmentation technique is in object detectiont asks.",
  "title": null,
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "MuVER",
  "full_name": "MuVER",
  "description": "**Multi-View Entity Representations**, or **MuVER**, is an approach for entity retrieval that constructs multi-view representations for entity descriptions and approximates the optimal view for mentions via a heuristic searching method. It matches a mention to the appropriate entity by comparing it with entity descriptions. Motivated by the fact that mentions with different contexts correspond to different parts in descriptions, multi-view representations are constructed for each description. Specifically, we segment a description into several sentences. We refer to each sentence as a view $v$, which contains partial information, to form a view set $\\mathcal{V}$ of the entity $e$. The Figure illustrates an example that constructs a view set $\\mathcal{V}$ for “Kobe Bryant”.",
  "title": "MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations",
  "collection": "Entity Retrieval Models",
  "area": "Natural Language Processing"
}
{
  "name": "Bridge-net",
  "full_name": "Bridge-net",
  "description": "**Bridge-net** is an audio model block used in the [ClariNet](https://paperswithcode.com/method/clarinet) text-to-speech architecture. Bridge-net maps frame-level hidden representation to sample-level through several [convolution](https://paperswithcode.com/method/convolution) blocks and [transposed convolution](https://paperswithcode.com/method/transposed-convolution) layers interleaved with softsign non-linearities.",
  "title": "ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech",
  "collection": "Audio Model Blocks",
  "area": "Audio"
}
{
  "name": "Stochastic Gradient Variational Bayes",
  "full_name": "Stochastic Gradient Variational Bayes",
  "description": "",
  "title": "Auto-Encoding Variational Bayes",
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "Expected Sarsa",
  "full_name": "Expected Sarsa",
  "description": "**Expected Sarsa** is like [Q-learning](https://paperswithcode.com/method/q-learning) but instead of taking the maximum over next state-action pairs, we use the expected value, taking into account how likely each action is under the current policy.\r\n\r\n$$Q\\left(S\\_{t}, A\\_{t}\\right) \\leftarrow Q\\left(S\\_{t}, A\\_{t}\\right) + \\alpha\\left[R_{t+1} + \\gamma\\sum\\_{a}\\pi\\left(a\\mid{S\\_{t+1}}\\right)Q\\left(S\\_{t+1}, a\\right) - Q\\left(S\\_{t}, A\\_{t}\\right)\\right] $$\r\n\r\nExcept for this change to the update rule, the algorithm otherwise follows the scheme of Q-learning. It is more computationally expensive than [Sarsa](https://paperswithcode.com/method/sarsa) but it eliminates the variance due to the random selection of $A\\_{t+1}$.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition",
  "title": null,
  "collection": "On-Policy TD Control",
  "area": "Reinforcement Learning"
}
{
  "name": "AdamW",
  "full_name": "AdamW",
  "description": "**AdamW** is a stochastic optimization method that modifies the typical implementation of weight decay in [Adam](https://paperswithcode.com/method/adam), by decoupling [weight decay](https://paperswithcode.com/method/weight-decay) from the gradient update. To see this, $L\\_{2}$ regularization in Adam is usually implemented with the below modification where $w\\_{t}$ is the rate of the weight decay at time $t$:\r\n\r\n$$ g\\_{t} = \\nabla{f\\left(\\theta\\_{t}\\right)} + w\\_{t}\\theta\\_{t}$$\r\n\r\nwhile AdamW adjusts the weight decay term to appear in the gradient update:\r\n\r\n$$ \\theta\\_{t+1, i} = \\theta\\_{t, i} - \\eta\\left(\\frac{1}{\\sqrt{\\hat{v}\\_{t} + \\epsilon}}\\cdot{\\hat{m}\\_{t}} + w\\_{t, i}\\theta\\_{t, i}\\right), \\forall{t}$$",
  "title": "Decoupled Weight Decay Regularization",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Graph Transformer",
  "full_name": "Graph Transformer",
  "description": "This is **Graph Transformer** method, proposed as a generalization of [Transformer](https://paperswithcode.com/method/transformer) Neural Network architectures, for arbitrary graphs.\r\n\r\nCompared to the original Transformer, the highlights of the presented architecture are:\r\n\r\n- The attention mechanism is a function of neighborhood connectivity for each node in the graph.  \r\n- The position encoding is represented by Laplacian eigenvectors, which naturally generalize the sinusoidal positional encodings often used in NLP.  \r\n- The [layer normalization](https://paperswithcode.com/method/layer-normalization) is replaced by a [batch normalization](https://paperswithcode.com/method/batch-normalization) layer.  \r\n- The architecture is extended to have edge representation, which can be critical to tasks with rich information on the edges, or pairwise interactions (such as bond types in molecules, or relationship type in KGs. etc).",
  "title": "A Generalization of Transformer Networks to Graphs",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "YOLOv2",
  "full_name": "YOLOv2",
  "description": "**YOLOv2**, or [**YOLO9000**](https://www.youtube.com/watch?v=QsDDXSmGJZA), is a single-stage real-time object detection model. It improves upon [YOLOv1](https://paperswithcode.com/method/yolov1) in several ways, including the use of [Darknet-19](https://paperswithcode.com/method/darknet-19) as a backbone, [batch normalization](https://paperswithcode.com/method/batch-normalization), use of a high-resolution classifier, and the use of anchor boxes to predict bounding boxes, and more.",
  "title": "YOLO9000: Better, Faster, Stronger",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "PVT",
  "full_name": "Pyramid Vision Transformer",
  "description": "**PVT**, or **Pyramid Vision Transformer**, is a type of [vision transformer](https://paperswithcode.com/methods/category/vision-transformer) that utilizes a pyramid structure to make it an effective backbone for dense prediction tasks. Specifically it allows for more fine-grained inputs (4 x 4 pixels per patch) to be used, while simultaneously shrinking the sequence length of the Transformer as it deepens - reducing the computational cost. Additionally, a [spatial-reduction attention](https://paperswithcode.com/method/spatial-reduction-attention) (SRA) layer is used to further reduce the resource consumption when learning high-resolution features.\r\n\r\nThe entire model is divided into four stages, each of which is comprised of a patch embedding layer and a $\\mathcal{L}\\_{i}$-layer Transformer encoder. Following a pyramid structure, the output resolution of the four stages progressively shrinks from high (4-stride) to low (32-stride).",
  "title": "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "CheXNet",
  "full_name": "CheXNet",
  "description": "**CheXNet** is a 121-layer [DenseNet](https://paperswithcode.com/method/densenet) trained on ChestX-ray14 for pneumonia detection.",
  "title": "CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Estimation Statistics",
  "full_name": "Estimation Statistics",
  "description": "Estimation statistics is a data analysis framework that uses a combination of effect sizes, confidence intervals, precision planning, and meta-analysis to plan experiments, analyze data and interpret results. It is distinct from null hypothesis significance testing (NHST), which is considered to be less informative. The primary aim of estimation methods is to report an effect size (a point estimate) along with its confidence interval, the latter of which is related to the precision of the estimate. The confidence interval summarizes a range of likely values of the underlying population effect. Proponents of estimation see reporting a P value as an unhelpful distraction from the important business of reporting an effect size with its confidence intervals, and believe that estimation should replace significance testing for data analysis.",
  "title": null,
  "collection": "Statistical Inference",
  "area": "General"
}
{
  "name": "Sharpness-Aware Minimization",
  "full_name": "Sharpness-Aware Minimization",
  "description": "**Sharpness-Aware Minimization**, or **SAM**, is a procedure that improves model generalization by simultaneously minimizing loss value and loss sharpness. SAM functions by seeking parameters that lie in neighborhoods having uniformly low loss value (rather than parameters that only themselves have low loss value).",
  "title": "Sharpness-Aware Minimization for Efficiently Improving Generalization",
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "Fast-YOLOv4-SmallObj",
  "full_name": "Fast-YOLOv4-SmallObj",
  "description": "The Fast-YOLOv4-SmallObj model is a modified version of Fast-[YOLOv4](https://paperswithcode.com/method/yolov4) to improve the detection of small objects. Seven layers were added so that it predicts bounding boxes at 3 different scales instead of 2.",
  "title": "Towards Image-based Automatic Meter Reading in Unconstrained Scenarios: A Robust and Efficient Approach",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "SqueezeNeXt Block",
  "full_name": "SqueezeNeXt Block",
  "description": "A **SqueezeNeXt Block** is a two-stage bottleneck module used in the [SqueezeNeXt](https://paperswithcode.com/method/squeezenext) architecture to reduce the number of input channels to the 3 × 3 [convolution](https://paperswithcode.com/method/convolution). We decompose with separable convolutions to further reduce the number of parameters (orange parts), followed by a 1 × 1 expansion module.",
  "title": "SqueezeNext: Hardware-Aware Neural Network Design",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "Position-Sensitive RoIAlign",
  "full_name": "Position-Sensitive RoIAlign",
  "description": "**Position-Sensitive RoIAlign** is a positive sensitive version of [RoIAlign](https://paperswithcode.com/method/roi-align) - i.e. it performs selective alignment, allowing for the learning of position-sensitive region of interest aligning.",
  "title": null,
  "collection": "RoI Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "SPL",
  "full_name": "Semi-Pseudo-Label",
  "description": "",
  "title": "A Novel Neural Network Training Method for Autonomous Driving Using Semi-Pseudo-Labels and 3D Data Augmentations",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "Gumbel Activation",
  "full_name": "Gumbel Cross Entropy",
  "description": "Gumbel activation function, is defined using the cumulative Gumbel distribution and it can be used to perform Gumbel regression. Gumbel activation is an alternative activation function to the sigmoid or softmax activation functions and can be used to transform the unormalised output of a model to probability. Gumbel activation  $\\eta_{Gumbel}$ is defined as follows:\r\n\r\n$\\eta_{Gumbel}(q_i) = exp(-exp(-q_i))$\r\n\r\nIt can be combined with Cross Entropy loss function to solve long-tailed classification problems. Gumbel Cross Entropy (GCE) is defined as follows:\r\n\r\n$GCE(\\eta_{Gumbel}(q_i),y_i) = -y_i \\log(\\eta_{Gumbel}(q_i))+ (1-y_i) \\log(1-\\eta_{Gumbel}(q_i))$",
  "title": "Long-tailed Instance Segmentation using Gumbel Optimized Loss",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "QHAdam",
  "full_name": "QHAdam",
  "description": "The **Quasi-Hyperbolic Momentum Algorithm (QHM)** is a simple alteration of [momentum SGD](https://paperswithcode.com/method/sgd-with-momentum), averaging a plain [SGD](https://paperswithcode.com/method/sgd) step with a momentum step. **QHAdam** is a QH augmented version of [Adam](https://paperswithcode.com/method/adam), where we replace both of Adam's moment estimators with quasi-hyperbolic terms. QHAdam decouples the momentum term from the current gradient when updating the weights, and decouples the mean squared gradients term from the current squared gradient when updating the weights. \r\n\r\nIn essence, it is a weighted average of the momentum and plain SGD, weighting the current gradient with an immediate discount factor $v\\_{1}$ divided by a weighted average of the mean squared gradients and the current squared gradient, weighting the current squared gradient with an immediate discount factor $v\\_{2}$. \r\n\r\n$$ \\theta\\_{t+1, i} = \\theta\\_{t, i} - \\eta\\left[\\frac{\\left(1-v\\_{1}\\right)\\cdot{g\\_{t}} + v\\_{1}\\cdot\\hat{m}\\_{t}}{\\sqrt{\\left(1-v\\_{2}\\right)g^{2}\\_{t} + v\\_{2}\\cdot{\\hat{v}\\_{t}}} + \\epsilon}\\right], \\forall{t} $$\r\n\r\nIt is recommended to set $v\\_{2} = 1$ and $\\beta\\_{2}$ same as in Adam.",
  "title": "Quasi-hyperbolic momentum and Adam for deep learning",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Perceiver IO",
  "full_name": "Perceiver IO",
  "description": "Perceiver IO is a general neural network architecture that performs well for structured input modalities and output tasks. Perceiver IO is built to easily integrate and transform arbitrary information for arbitrary tasks.",
  "title": "Perceiver IO: A General Architecture for Structured Inputs & Outputs",
  "collection": null,
  "area": null
}
{
  "name": "Bilateral Grid",
  "full_name": "Bilateral Grid",
  "description": "Bilateral grid is a new data structure that enables fast edge-aware image processing. It enables edge-aware image manipulations such as local tone mapping on high resolution images in real time.\r\n\r\nSource: [Chen et al.](https://people.csail.mit.edu/sparis/publi/2007/siggraph/Chen_07_Bilateral_Grid.pdf)\r\n\r\nImage source: [Chen et al.](https://people.csail.mit.edu/sparis/publi/2007/siggraph/Chen_07_Bilateral_Grid.pdf)",
  "title": "Deep Bilateral Learning for Real-Time Image Enhancement",
  "collection": "Image Representations",
  "area": "Computer Vision"
}
{
  "name": "WEGL",
  "full_name": "Wasserstein Embedding for Graph Learning",
  "description": "Please enter a description here",
  "title": "Wasserstein Embedding for Graph Learning",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "MeshGraphNet",
  "full_name": "MeshGraphNet",
  "description": "**MeshGraphNet** is a framework for learning mesh-based simulations using [graph neural networks](https://paperswithcode.com/methods/category/graph-models). The model can be trained to pass messages on a mesh graph and to adapt the mesh discretization during forward simulation. The model uses an Encode-Process-Decode architecture trained with one-step supervision, and can be applied iteratively to generate long trajectories at inference time. The encoder transforms the input mesh $M^{t}$ into a graph, adding extra world-space edges. The processor performs several rounds of message passing along mesh edges and world edges, updating all node and edge embeddings. The decoder extracts the acceleration for each node, which is used to update the mesh to produce $M^{t+1}$.",
  "title": "Learning Mesh-Based Simulation with Graph Networks",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "K3M",
  "full_name": "K3M",
  "description": "**K3M** is a multi-modal pretraining method for e-commerce product data that introduces knowledge modality to correct the noise and supplement the missing of image and text modalities. The modal-encoding layer extracts the features of each modality. The modal-interaction layer is capable of effectively modeling the interaction of multiple modalities, where an initial-interactive feature fusion model is designed to maintain the independence of image modality and text modality, and a structure aggregation module is designed to fuse the information of image, text, and knowledge modalities. K3M is pre-trained with three pretraining tasks, including masked object modeling (MOM), masked language modeling (MLM), and link prediction modeling ([LPM](https://paperswithcode.com/method/local-prior-matching)).",
  "title": "Knowledge Perceived Multi-modal Pretraining in E-commerce",
  "collection": "Language Model Pre-Training",
  "area": "Natural Language Processing"
}
{
  "name": "PatchGAN",
  "full_name": "PatchGAN",
  "description": "**PatchGAN** is a type of discriminator for generative adversarial networks which only penalizes structure at the scale of local image patches. The PatchGAN discriminator tries to classify if each $N \\times N$ patch in an image is real or fake. This discriminator is run convolutionally across the image, averaging all responses to provide the ultimate output of $D$. Such a discriminator effectively models the image as a Markov random field, assuming independence between pixels separated by more than a patch diameter. It can be understood as a type of texture/style loss.",
  "title": "Image-to-Image Translation with Conditional Adversarial Networks",
  "collection": "Discriminators",
  "area": "General"
}
{
  "name": "ReInfoSelect",
  "full_name": "ReInfoSelect",
  "description": "**ReInfoSelect** is a reinforcement weak supervision selection method for information retrieval. It learns to select anchor-document pairs that best weakly supervise the neural ranker (action), using the ranking performance on a handful of relevance labels as the reward. Iteratively, for a batch of anchor-document pairs, ReInfoSelect back propagates the gradients through the neural ranker, gathers its NDCG reward, and optimizes the data selection network using policy gradients, until the neural ranker's performance peaks on target relevance metrics (convergence).",
  "title": "Selective Weak Supervision for Neural Information Retrieval",
  "collection": "Information Bottleneck",
  "area": "General"
}
{
  "name": "Fire Module",
  "full_name": "Fire Module",
  "description": "A **Fire Module** is a building block for convolutional neural networks, notably used as part of [SqueezeNet](https://paperswithcode.com/method/squeezenet). A Fire module is comprised of: a squeeze [convolution](https://paperswithcode.com/method/convolution) layer (which has only 1x1 filters), feeding into an expand layer that has a mix of 1x1 and 3x3 convolution filters.  We expose three tunable dimensions (hyperparameters) in a Fire module: $s\\_{1x1}$, $e\\_{1x1}$, and $e\\_{3x3}$. In a Fire module, $s\\_{1x1}$ is the number of filters in the squeeze layer (all 1x1), $e\\_{1x1}$ is the number of 1x1 filters in the expand layer, and $e\\_{3x3}$ is the number of 3x3 filters in the expand layer. When we use Fire modules we set $s\\_{1x1}$ to be less than ($e\\_{1x1}$ + $e\\_{3x3}$), so the squeeze layer helps to limit the number of input channels to the 3x3 filters.",
  "title": "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "WGAN GP",
  "full_name": "Wasserstein GAN (Gradient Penalty)",
  "description": "**Wasserstein GAN + Gradient Penalty**, or **WGAN-GP**, is a generative adversarial network that uses the Wasserstein loss formulation plus a gradient norm penalty to achieve Lipschitz continuity.\r\n\r\nThe original [WGAN](https://paperswithcode.com/method/wgan) uses weight clipping to achieve 1-Lipschitz functions, but this can lead to undesirable behaviour by creating pathological value surfaces and capacity underuse, as well as gradient explosion/vanishing without careful tuning of the weight clipping parameter $c$.\r\n\r\nA Gradient Penalty is a soft version of the Lipschitz constraint, which follows from the fact that functions are 1-Lipschitz iff the gradients are of norm at most 1 everywhere. The squared difference from norm 1 is used as the gradient penalty.",
  "title": "Improved Training of Wasserstein GANs",
  "collection": "Generative Adversarial Networks",
  "area": "Computer Vision"
}
{
  "name": "Rank-based Loss",
  "full_name": "Rank-based loss",
  "description": "",
  "title": "Rank-based loss for learning hierarchical representations",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "MutualGuide",
  "full_name": "Mutual Guidance",
  "description": "",
  "title": "Localize to Classify and Classify to Localize: Mutual Guidance in Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "SCARLET-NAS",
  "full_name": "SCARLET-NAS",
  "description": "**SCARLET-NAS** is a type of [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) that utilises a learnable stabilizer to calibrate feature deviation, named the Equivariant Learnable Stabilizer (ELS). Previous one-shot approaches can be limited by fixed-depth search spaces. With SCARLET-NAS, we use the equivariant learnable stabilizer on each skip connection. This can lead to improved convergence, more reliable evaluation, and retained equivalence. The third benefit is deemed most important by the authors for scalability.",
  "title": "SCARLET-NAS: Bridging the Gap between Stability and Scalability in Weight-sharing Neural Architecture Search",
  "collection": "Neural Architecture Search",
  "area": "General"
}
{
  "name": "CRF",
  "full_name": "Conditional Random Field",
  "description": "**Conditional Random Fields** or **CRFs** are a type of probabilistic graph model that take neighboring sample context into account for tasks like classification. Prediction is modeled as a graphical model, which implements dependencies between the predictions. Graph choice depends on the application, for example linear chain CRFs are popular in natural language processing, whereas in image-based tasks, the graph would connect to neighboring locations in an image to enforce that they have similar predictions.\r\n\r\nImage Credit: [Charles Sutton and Andrew McCallum, An Introduction to Conditional Random Fields](https://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf)",
  "title": null,
  "collection": "Structured Prediction",
  "area": "General"
}
{
  "name": "FeatureNMS",
  "full_name": "FeatureNMS",
  "description": "**Feature Non-Maximum Suppression**, or **FeatureNMS**, is a post-processing step for object detection models that removes duplicates where there are multiple detections outputted per object. FeatureNMS recognizes duplicates not only based on the intersection over union between the bounding boxes, but also based on the difference of feature vectors. These feature vectors can encode more information like visual appearance.",
  "title": "FeatureNMS: Non-Maximum Suppression by Learning Feature Embeddings",
  "collection": "Proposal Filtering",
  "area": "Computer Vision"
}
{
  "name": "Inception-ResNet-v2",
  "full_name": "Inception-ResNet-v2",
  "description": "**Inception-ResNet-v2** is a convolutional neural architecture that builds on the Inception family of architectures but incorporates [residual connections](https://paperswithcode.com/method/residual-connection) (replacing the filter concatenation stage of the Inception architecture).",
  "title": "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Kaleido-BERT",
  "full_name": "Kaleido-BERT",
  "description": "**Kaleido-BERT**(CVPR2021) is the pioneering work that focus on solving PTM in e-commerce field. It achieves SOTA performances compared with many models published in general domain.",
  "title": null,
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "RESCAL-RP",
  "full_name": "RESCAL with Relation Prediction",
  "description": "RESCAL model trained with a relation prediction objective on top of the 1vsAll loss",
  "title": "Relation Prediction as an Auxiliary Training Objective for Improving Multi-Relational Graph Representations",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "Deep Belief Network",
  "full_name": "Deep Belief Network",
  "description": "A **Deep Belief Network (DBN)** is a multi-layer generative graphical model. DBNs have bi-directional connections ([RBM](https://paperswithcode.com/method/restricted-boltzmann-machine)-type connections) on the top layer while the bottom layers only have top-down connections. They are trained using layerwise pre-training. Pre-training occurs by training the network component by component bottom up: treating the first two layers as an RBM and training, then treating the second layer and third layer as another RBM and training for those parameters.\r\n\r\nSource: [Origins of Deep Learning](https://arxiv.org/pdf/1702.07800.pdf)\r\n\r\nImage Source: [Wikipedia](https://en.wikipedia.org/wiki/Deep_belief_network)",
  "title": null,
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Virtual Data Augmentation",
  "full_name": "Virtual Data Augmentation",
  "description": "**Virtual Data Augmentation**, or **VDA**, is a framework for robustly fine-tuning pre-trained language model. Based on the original token embeddings, a multinomial mixture for augmenting virtual data is constructed, where a masked language model guarantees the semantic relevance and the Gaussian noise provides the augmentation diversity. Furthermore, a regularized training strategy is proposed to balance the two aspects.",
  "title": "Virtual Data Augmentation: A Robust and General Framework for Fine-tuning Pre-trained Models",
  "collection": "Fine-Tuning",
  "area": "General"
}
{
  "name": "CSGLD",
  "full_name": "Contour Stochastic Gradient Langevin Dynamics",
  "description": "Simulations of multi-modal distributions can be very costly and often lead to unreliable predictions. To accelerate the computations, we propose to sample from a flattened distribution to accelerate the computations and estimate the importance weights between the original distribution and the flattened distribution to ensure the correctness of the distribution.",
  "title": "A Contour Stochastic Gradient Langevin Dynamics Algorithm for Simulations of Multi-modal Distributions",
  "collection": "Markov Chain Monte Carlo",
  "area": "General"
}
{
  "name": "Multiplicative Attention",
  "full_name": "Multiplicative Attention",
  "description": "**Multiplicative Attention** is an attention mechanism where the alignment score function is calculated as:\r\n\r\n$$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = \\mathbf{h}\\_{i}^{T}\\textbf{W}\\_{a}\\mathbf{s}\\_{j}$$\r\n\r\nHere $\\mathbf{h}$ refers to the hidden states for the encoder/source, and $\\mathbf{s}$ is the hidden states for the decoder/target. The function above is thus a type of alignment score function. We can use a matrix of alignment scores to show the correlation between source and target words, as the Figure to the right shows. Within a neural network, once we have the alignment scores, we calculate the final scores using a [softmax](https://paperswithcode.com/method/softmax) function of these alignment scores (ensuring it sums to 1).\r\n\r\nAdditive and multiplicative attention are similar in complexity, although multiplicative attention is faster and more space-efficient in practice as it can be implemented more efficiently using matrix multiplication. Both variants perform similar for small dimensionality $d_{h}$ of the decoder states, but [additive attention](https://paperswithcode.com/method/additive-attention) performs better for larger dimensions. One way to mitigate this is to scale $f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right)$ by $1/\\sqrt{d\\_{h}}$ as with [scaled dot-product attention](https://paperswithcode.com/method/scaled).",
  "title": "Effective Approaches to Attention-based Neural Machine Translation",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "NODE",
  "full_name": "Neural Oblivious Decision Ensembles",
  "description": "**Neural Oblivious Decision Ensembles (NODE)** is a tabular data architecture that consists of differentiable\r\noblivious decision trees (ODT) that are trained end-to-end by backpropagation. \r\n\r\nThe core building block is a Neural Oblivious Decision Ensemble (NODE) layer. The layer is composed of $m$ differentiable oblivious decision trees (ODTs) of equal depth $d$. As an input, all $m$ trees get a common vector $x \\in \\mathbb{R}^{n}$, containing $n$ numeric features. Below we describe a design of a single differentiable ODT.\r\n\r\nIn its essence, an ODT is a decision table that splits the data along $d$ splitting features and compares each feature to a learned threshold. Then, the tree returns one of the $2^{d}$ possible responses, corresponding to the comparisons result. Therefore, each ODT is completely determined by its splitting features $f \\in \\mathbb{R}^{d}$, splitting thresholds $b \\in \\mathbb{R}^{d}$ and a $d$-dimensional tensor of responses $R \\in \\mathbb{R} \\underbrace{2 \\times 2 \\times 2}_{d}$. In this notation, the tree output is defined as:\r\n\r\n$$\r\nh(x)=R\\left[\\mathbb{1}\\left(f\\_{1}(x)-b_{1}\\right), \\ldots, \\mathbb{1}\\left(f\\_{d}(x)-b\\_{d}\\right)\\right]\r\n$$\r\nwhere $\\mathbb{1}(\\cdot)$ denotes the Heaviside function.",
  "title": "Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "Deformable RoI Pooling",
  "full_name": "Deformable RoI Pooling",
  "description": "**Deformable RoI Pooling** adds an offset to each bin position in the regular bin partition of the RoI Pooling. Similarly, the offsets are learned from the preceding feature maps and the RoIs, enabling adaptive part localization for objects with different shapes.",
  "title": "Deformable Convolutional Networks",
  "collection": "RoI Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "WenLan",
  "full_name": "WenLan",
  "description": "Proposes a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. A cross-modal pre-training model is defined based on the image-text retrieval task. The main goal is thus to learn two encoders that can embed image and text samples into the same space for effective image-text retrieval. To enforce such cross-modal embedding learning, we introduce contrastive learning with the InfoNCE loss into the BriVL model. Given text embedding, the learning objective aims to find the best image embedding from a batch of image embeddings. Similarly, for a given image embedding, the learning objective is to find the best text embedding from a batch of text embeddings. The pre-training model learns a cross-modal embedding space by jointly training the image and text encoders to maximize the cosine similarity of the image and text embeddings of the true pair for each sample in the batch while minimizing the cosine similarity of the embeddings of the other incorrect pairs.",
  "title": "WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "Matrix NMS",
  "full_name": "Matrix Non-Maximum Suppression",
  "description": "**Matrix NMS**, or **Matrix Non-Maximum Suppression**,  performs [non-maximum suppression](https://paperswithcode.com/method/non-maximum-suppression) with parallel matrix operations in one shot. It is motivated by [Soft-NMS](https://paperswithcode.com/method/soft-nms). Soft-NMS decays the other detection scores as a monotonic decreasing function $f(iou)$ of their overlaps. By decaying the scores according to IoUs recursively, higher IoU detections will be eliminated with a minimum score threshold. However, such process is sequential like traditional Greedy NMS and can not be implemented in parallel.\r\n\r\nMatrix NMS views this process from another perspective by considering how a predicted mask $m\\_{j}$ being suppressed. For $m\\_{j}$, its decay factor is affected by: (a) The penalty of each prediction $m\\_{i}$ on $m\\_{j}$ $\\left(s\\_{i}>s\\_{j}\\right)$, where $s\\_{i}$ and $s\\_{j}$ are the confidence scores; and (b) the probability of $m\\_{i}$ being suppressed. For (a), the penalty of each prediction $m\\_{i}$ on $m\\_{j}$ could be easily computed by $f\\left(\\right.$ iou $\\left.\\_{i, j}\\right)$. For (b), the probability of $m\\_{i}$ being suppressed is not so elegant to be computed. However, the probability usually has positive correlation with the IoUs. So here we directly approximate the probability by the most overlapped prediction on $m\\_{i}$ as\r\n\r\n$$\r\nf\\left(\\text { iou. }\\_{, i}\\right)=\\min\\_{\\forall s\\_{k}>s\\_{i}} f\\left(\\text { iou }\\_{k, i}\\right)\r\n$$\r\n\r\nTo this end, the final decay factor becomes\r\n\r\n$$\r\n\\operatorname{decay}\\_{j}=\\min\\_{\\forall s\\_{i}>s\\_{j}} \\frac{f\\left(\\text { iou }\\_{i, j}\\right)}{f\\left(\\text { iou }\\_{\\cdot, i}\\right)}\r\n$$\r\n\r\nand the updated score is computed by $s\\_{j}=s\\_{j} \\cdot$ decay $\\_{j} .$ The authors consider the two most simple decremented functions, denoted as linear $f\\left(\\right.$ iou $\\left.\\_{i, j}\\right)=1-$ iou $\\_{i, j}$, and Gaussian $f\\left(\\right.$ iou $\\left.\\_{i, j}\\right)=\\exp \\left(-\\frac{i o u\\_{i, j}^{2}}{\\sigma}\\right)$.",
  "title": "SOLOv2: Dynamic and Fast Instance Segmentation",
  "collection": "Proposal Filtering",
  "area": "Computer Vision"
}
{
  "name": "CornerNet",
  "full_name": "CornerNet",
  "description": "**CornerNet** is an object detection model that detects an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single [convolution](https://paperswithcode.com/method/convolution) neural network. By detecting objects as paired keypoints, we eliminate the need for designing a set of anchor boxes commonly used in prior single-stage detectors. It also utilises [corner pooling](https://paperswithcode.com/method/corner-pooling), a new type of pooling layer than helps the network better localize corners.",
  "title": "CornerNet: Detecting Objects as Paired Keypoints",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "Fourier Contour Embedding",
  "full_name": "Fourier Contour Embedding",
  "description": "**Fourier Contour Embedding** is a text instance representation that allows networks to learn diverse text geometry variances. Most of existing methods model text instances in image spatial domain via masks or contour point sequences in the Cartesian or the polar coordinate system. However, the mask representation might lead to expensive post-processing, while the point sequence one may have limited capability to model texts with highly-curved shapes. This motivates modeling text instances in the Fourier domain.",
  "title": "Fourier Contour Embedding for Arbitrary-Shaped Text Detection",
  "collection": "Text Instance Representations",
  "area": "Computer Vision"
}
{
  "name": "IFBlock",
  "full_name": "IFBlock",
  "description": "**IFBlock** is a video model block used in the [IFNet](https://paperswithcode.com/method/ifnet) architecture for video frame interpolation. IFBlocks do not contain expensive operators like cost volume or forward warping and use 3 × 3 convolution and deconvolution as building blocks. Each IFBlock has a feed-forward structure consisting of several convolutional layers and an upsampling operator. Except for the layer that outputs the optical flow residuals and the fusion map, [PReLU](https://paperswithcode.com/method/prelu) activations are used.",
  "title": "RIFE: Real-Time Intermediate Flow Estimation for Video Frame Interpolation",
  "collection": "Video Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "AutoSync",
  "full_name": "AutoSync",
  "description": "**AutoSync** is a pipeline for automatically optimizing synchronization strategies, given model structures and resource specifications, in data-parallel distributed machine learning. By factorizing the synchronization strategy with respect to each trainable building block of a DL model, we can construct a valid and large strategy space spanned by multiple factors. AutoSync efficiently navigates the space and locates the optimal strategy. AutoSync leverages domain knowledge about synchronization systems to reduce the search space, and is equipped with a domain adaptive simulator, which combines principled communication modeling and data-driven ML models, to estimate the runtime of strategy proposals without launching real distributed execution.",
  "title": "AutoSync: Learning to Synchronize for Data-Parallel Distributed Deep Learning",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "IkshanaNet",
  "full_name": "The Ikshana Hypothesis of Human Scene Understanding Mechanism",
  "description": "",
  "title": "The Ikshana Hypothesis of Human Scene Understanding",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "DropAttack",
  "full_name": "DropAttack",
  "description": "**DropAttack** is an adversarial training method that adds intentionally worst-case adversarial perturbations to both the input and hidden layers in different dimensions and minimizes the adversarial risks generated by each layer.",
  "title": "DropAttack: A Masked Weight Adversarial Training Method to Improve Generalization of Neural Networks",
  "collection": "Adversarial Training",
  "area": "General"
}
{
  "name": "RTMDet",
  "full_name": "RTMDet: An Empirical Study of Designing Real-Time Object Detectors",
  "description": "Please enter a description about the method here",
  "title": null,
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "PAU",
  "full_name": "Padé Activation Units",
  "description": "Parametrized learnable activation function, based on the Padé approximant.",
  "title": "Padé Activation Units: End-to-end Learning of Flexible Activation Functions in Deep Networks",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "SC-GPT",
  "full_name": "SC-GPT",
  "description": "**SC-GPT** is a multi-layer [Transformer](http://paperswithcode.com/method/transformer) neural language model, trained in three steps: (i) Pre-trained on plain text, similar to [GPT-2](http://paperswithcode.com/method/gpt-2); (ii) Continuously pretrained on large amounts of dialog-act labeled utterances corpora to acquire the ability of controllable generation; (iii) Fine-tuned for a target domain using very limited amounts of domain labels. Unlike [GPT-2](http://paperswithcode.com/method/gpt-2), SC-GPT generates semantically controlled responses that are conditioned on the given semantic form, similar to SC-[LSTM](https://paperswithcode.com/method/lstm) but requiring much less domain labels to generalize to new domains. It is pre-trained on a large set of annotated NLG corpus to acquire the controllable generation ability, and fine-tuned with only a few domain-specific labels to adapt to new domains.",
  "title": "Few-shot Natural Language Generation for Task-Oriented Dialog",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Instance-Level Meta Normalization",
  "full_name": "Instance-Level Meta Normalization",
  "description": "**Instance-Level Meta Normalization** is a normalization method that addresses a learning-to-normalize problem. ILM-Norm learns to predict the normalization parameters via both the feature feed-forward and the gradient back-propagation paths. It uses an auto-encoder to predict the weights $\\omega$ and bias $\\beta$ as the rescaling parameters for recovering the distribution of the tensor $x$ of feature maps. Instead of using the entire feature tensor $x$ as the input for the auto-encoder, it uses the mean $\\mu$ and variance $\\gamma$ of $x$ for characterizing its statistics.",
  "title": "Instance-Level Meta Normalization",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "Barlow Twins",
  "full_name": "Barlow Twins",
  "description": "**Barlow Twins** is a self-supervised learning method that applies redundancy-reduction — a principle first proposed in neuroscience — to self supervised learning. The objective function measures the cross-correlation matrix between the embeddings of two identical networks fed with distorted versions of a batch of samples, and tries to make this matrix close to the identity. This causes the embedding vectors of distorted version of a sample to be similar, while minimizing the redundancy between the components of these vectors. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors.",
  "title": "Barlow Twins: Self-Supervised Learning via Redundancy Reduction",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "TUM",
  "full_name": "Thinned U-shape Module",
  "description": "**Thinned U-shape Module**, or **TUM**, is a feature extraction block used for object detection models. It was introduced as part of the [M2Det](https://paperswithcode.com/method/m2det) architecture. Different from [FPN](https://paperswithcode.com/method/fpn) and [RetinaNet](https://paperswithcode.com/method/retinanet), TUM adopts a thinner U-shape structure as illustrated in the Figure to the right. The encoder is a series of 3x3 [convolution](https://paperswithcode.com/method/convolution) layers with stride 2. And the decoder takes the outputs of these layers as its reference set of feature maps, while the original FPN chooses the output of the last layer of each stage in [ResNet](https://paperswithcode.com/method/resnet) backbone. \r\n\r\nIn addition, with TUM, we add [1x1 convolution](https://paperswithcode.com/method/1x1-convolution) layers after the upsample and element-wise sum operation at the decoder branch to enhance learning ability and keep smoothness for the features. In the context of M2Det, all of the outputs in the decoder of each TUM form the multi-scale features of the current level. As a whole, the outputs of stacked TUMs form the multi-level multi-scale features, while the front TUM mainly provides shallow-level features, the middle TUM provides medium-level features, and the back TUM provides deep-level features.",
  "title": "M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "LMU",
  "full_name": "Legendre Memory Unit",
  "description": "The Legendre Memory Unit (LMU) is mathematically derived to orthogonalize\r\nits continuous-time history – doing so by solving d coupled ordinary differential\r\nequations (ODEs), whose phase space linearly maps onto sliding windows of\r\ntime via the Legendre polynomials up to degree d-1.  It is optimal for compressing temporal information.\r\n\r\nSee paper for equations (markdown isn't working).\r\n\r\nOfficial github repo: [https://github.com/abr/lmu](https://github.com/abr/lmu)",
  "title": "Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "AutoSmart",
  "full_name": "AutoSmart",
  "description": "**AutoSmart** is AutoML framework for temporal relational data. The framework includes automatic data processing, table merging, feature engineering, and model tuning, integrated with a time&memory control unit.",
  "title": "AutoSmart: An Efficient and Automatic Machine Learning framework for Temporal Relational Data",
  "collection": "AutoML",
  "area": "General"
}
{
  "name": "Attribute2Font",
  "full_name": "Attribute2Font",
  "description": "**Attribute2Font** is a model that automatically creates fonts by synthesizing visually pleasing glyph images according to user-specified attributes and their corresponding values. Specifically, Attribute2Font is trained to perform font style transfer between any two fonts conditioned on their attribute values. After training, the model can generate glyph images in accordance with an arbitrary set of font attribute values. A unit named Attribute Attention Module is designed to make those generated glyph images better embody the prominent font attributes. A semi-supervised learning scheme is also introduced to exploit a large number of unlabeled fonts",
  "title": "Attribute2Font: Creating Fonts You Want From Attributes",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Colorization",
  "full_name": "Colorization",
  "description": "**Colorization** is a self-supervision approach that relies on colorization as the pretext task in order to learn image representations.",
  "title": "Colorful Image Colorization",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "mRNN",
  "full_name": "Multiplicative RNN",
  "description": "A **Multiplicative RNN (mRNN)** is a type of recurrent neural network with multiplicative connections. In a standard RNN, the current input $x\\_{t}$ is first transformed via the visible-to-hidden weight matrix $W\\_{hx}$ and then contributes additively to the input for the current hidden state. An mRNN allows the current input (a character in the original example) to affect the hidden state dynamics by determining the entire hidden-to-hidden matrix (which defines the non-linear dynamics) in addition to providing an additive bias.\r\n\r\nTo achieve this goal, the authors modify the RNN so that its hidden-to-hidden weight matrix is a (learned) function of the current input $x\\_{t}$:\r\n\r\n$$ h\\_{t} = \\tanh\\left(W\\_{hx}x\\_{t} + W\\_{hh}^{\\left(x\\_{y}\\right)}h\\_{t-1} + b\\_{h}\\right)$$\r\n\r\n$$ o\\_{t} = W\\_{oh}h\\_{t} + b\\_{o} $$\r\n\r\nThis is the same as the equations for a standard RNN, except that $W\\_{hh}$ is replaced with $W^{(xt)}\\_{hh}$. allowing each input (character) to specify a different hidden-to-hidden weight matrix.",
  "title": null,
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "UL2",
  "full_name": "UL2",
  "description": "**UL2** is a unified framework for pretraining models that are universally effective across datasets and setups. UL2 uses Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes.",
  "title": "UL2: Unifying Language Learning Paradigms",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "GCNII",
  "full_name": "GCNII",
  "description": "**GCNII** is an extension of a [Graph Convolution Networks](https://www.paperswithcode.com/method/gcn) with two new techniques, initial residual and identify mapping, to tackle the problem of oversmoothing -- where stacking more layers and adding non-linearity tends to degrade performance. At each layer, initial residual constructs a skip connection from the input layer, while identity mapping adds an identity matrix to the weight matrix.",
  "title": "Simple and Deep Graph Convolutional Networks",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "GEOMANCER",
  "full_name": "Geometric Manifold Component Estimator",
  "description": "**Geomancer** is a nonparametric algorithm for symmetry-based disentangling of data manifolds. It learns a set of subspaces to assign to each point in the dataset, where each subspace is the tangent space of one disentangled submanifold. This means that geomancer can be used to disentangle manifolds for which there may not be a global axis-aligned coordinate system.",
  "title": "Disentangling by Subspace Diffusion",
  "collection": "Manifold Disentangling",
  "area": "General"
}
{
  "name": "ARM-Net",
  "full_name": "ARM-Net",
  "description": "ARM-Net is an adaptive relation modeling network tailored for structured data, and a lightweight framework ARMOR based on ARM-Net for relational data analytics. The key idea is to model feature interactions with cross features selectively and dynamically, by first transforming the input features into exponential space, and then determining the interaction order and interaction weights adaptively for each cross feature. The authors propose a novel sparse attention mechanism to dynamically generate the interaction weights given the input tuple, so that we can explicitly model cross features of arbitrary orders with noisy features filtered selectively. Then during model inference, ARM-Net can specify the cross features being used for each prediction for higher accuracy and better interpretability.",
  "title": "ARM-Net: Adaptive Relation Modeling Network for Structured Data",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "PanNet",
  "full_name": "Pansharpening Network",
  "description": "We propose a deep network architecture for the pansharpening problem called PanNet. We incorporate domain-specific knowledge to design our PanNet architecture by focusing on the two aims of the pan-sharpening problem: spectral and spatial preservation. For spectral preservation, we add up-sampled multispectral images to the network output, which directly propagates the spectral information to the reconstructed image. To preserve the spatial structure, we train our network parameters in the high-pass filtering domain rather than the image domain. We show that the trained network generalizes well to images from different satellites without needing retraining. Experiments show significant improvement over state-of-the-art methods visually and in terms of standard quality metrics.",
  "title": "PanNet: A Deep Network Architecture for Pan-Sharpening",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "GBO",
  "full_name": "Gradient-based optimization",
  "description": "GBO is a novel metaheuristic optimization algorithm. The GBO, inspired by the gradient-based Newton’s method, uses two main operators: gradient search rule (GSR) and local escaping operator (LEO) and a set of vectors to explore the search space. The GSR employs the gradient-based method to enhance the exploration tendency and accelerate the convergence rate to achieve better positions in the search space. The LEO enables the proposed GBO to escape from local optima. The performance of the new algorithm was evaluated in two phases. 28 mathematical test functions were first used to evaluate various characteristics of the GBO, and then six engineering problems were optimized by the GBO. In the first phase, the GBO was compared with five existing optimization algorithms, indicating that the GBO yielded very promising results due to its enhanced capabilities of exploration, exploitation, convergence, and effective avoidance of local optima. The second phase also demonstrated the superior performance of the GBO in solving complex real-world engineering problems. \r\n\r\n* The source codes of GBO are publicly available at https://imanahmadianfar.com/codes/.",
  "title": null,
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "ScaleNet",
  "full_name": "ScaleNet",
  "description": "**ScaleNet**, or a **Scale Aggregation Network**, is a type of convolutional neural network which learns a neuron allocation for aggregating multi-scale information in different building blocks of a deep network. The most informative output neurons in each block are preserved while others are discarded, and thus neurons for multiple scales are competitively and adaptively allocated. The scale aggregation (SA) block concatenates feature maps at a wide range of scales. Feature maps for each scale are generated by a stack of downsampling, [convolution](https://paperswithcode.com/method/convolution) and upsampling operations.",
  "title": "Data-Driven Neuron Allocation for Scale Aggregation Networks",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "BytePS",
  "full_name": "BytePS",
  "description": "**BytePS** is a distributed training method for deep neural networks. BytePS handles cases with varying number of CPU machines and makes traditional all-reduce and PS as two special cases of its framework. To further accelerate DNN training, BytePS proposes Summation Service and splits a DNN optimizer into two parts: gradient summation and parameter update. It keeps the CPU-friendly part, gradient summation, in CPUs, and moves parameter update, which is more computation heavy, to GPUs.",
  "title": null,
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "ASFF",
  "full_name": "Adaptively Spatial Feature Fusion",
  "description": "**ASFF**, or **Adaptively Spatial Feature Fusion**, is a method for pyramidal feature fusion. It learns the way to spatially filter conflictive information to suppress inconsistency across different feature scales, thus improving the scale-invariance of features. \r\n\r\nASFF enables the network to directly learn how to spatially filter features at other levels so that only useful information is kept for combination. For the features at a certain level, features of other levels are first integrated and resized into the same resolution and then trained to find the optimal fusion. At each spatial location, features at different levels are fused adaptively, *i.e.*, some features may be filter out as they carry contradictory information at this location and some may dominate with more discriminative clues. ASFF offers several advantages: (1) as the operation of searching the optimal fusion is differential, it can be conveniently learned in back-propagation; (2) it is agnostic to the backbone model and it is applied to single-shot detectors that have a feature pyramid structure; and (3) its implementation is simple and the increased computational cost is marginal.\r\n\r\nLet $\\mathbf{x}_{ij}^{n\\rightarrow l}$ denote the feature vector at the position $(i,j)$ on the feature maps resized from level $n$ to level $l$. Following a feature resizing stage, we fuse the features at the corresponding level $l$ as follows:\r\n\r\n$$\r\n\\mathbf{y}\\_{ij}^l = \\alpha^l_{ij} \\cdot \\mathbf{x}\\_{ij}^{1\\rightarrow l} + \\beta^l_{ij} \\cdot \\mathbf{x}\\_{ij}^{2\\rightarrow l} +\\gamma^l\\_{ij} \\cdot \\mathbf{x}\\_{ij}^{3\\rightarrow l},\r\n$$\r\n\r\nwhere $\\mathbf{y}\\_{ij}^l$ implies the $(i,j)$-th vector of the output feature maps $\\mathbf{y}^l$ among channels. $\\alpha^l\\_{ij}$, $\\beta^l\\_{ij}$ and $\\gamma^l\\_{ij}$ refer to the spatial importance weights for the feature maps at three different levels to level $l$, which are adaptively learned by the network. Note that $\\alpha^l\\_{ij}$, $\\beta^l\\_{ij}$ and $\\gamma^l\\_{ij}$ can be simple scalar variables, which are shared across all the channels. Inspired by acnet, we force $\\alpha^l\\_{ij}+\\beta^l\\_{ij}+\\gamma^l\\_{ij}=1$ and $\\alpha^l\\_{ij},\\beta^l\\_{ij},\\gamma^l\\_{ij} \\in [0,1]$, and \r\n\r\n$$\r\n\t\\alpha^l_{ij} = \\frac{e^{\\lambda^l\\_{\\alpha\\_{ij}}}}{e^{\\lambda^l\\_{\\alpha_{ij}}} + e^{\\lambda^l\\_{\\beta_{ij}\r\n\t\t}} + e^{\\lambda^l\\_{\\gamma_{ij}}}}.\r\n$$\r\n\r\nHere $\\alpha^l\\_{ij}$, $\\beta^l\\_{ij}$ and $\\gamma^l\\_{ij}$ are defined by using the [softmax](https://paperswithcode.com/method/softmax) function with $\\lambda^l\\_{\\alpha_{ij}}$, $\\lambda^l\\_{\\beta_{ij}}$ and $\\lambda^l\\_{\\gamma_{ij}}$ as control parameters respectively. We use $1\\times1$ [convolution](https://paperswithcode.com/method/convolution) layers to compute the weight scalar maps $\\mathbf{\\lambda}^l_\\alpha$, $\\mathbf{\\lambda}^l\\_\\beta$ and $\\mathbf{\\lambda}^l\\_\\gamma$ from $\\mathbf{x}^{1\\rightarrow l}$, $\\mathbf{x}^{2\\rightarrow l}$ and $\\mathbf{x}^{3\\rightarrow l}$ respectively, and they can thus be learned through standard back-propagation.\r\n\r\nWith this method, the features at all the levels are adaptively aggregated at each scale. The outputs are used for object detection following the same pipeline of [YOLOv3](https://paperswithcode.com/method/yolov3).",
  "title": "Learning Spatial Fusion for Single-Shot Object Detection",
  "collection": "Feature Pyramid Blocks",
  "area": "Computer Vision"
}
{
  "name": "CPVT",
  "full_name": "Conditional Position Encoding Vision Transformer",
  "description": "**CPVT**, or **Conditional Position Encoding Vision Transformer**, is a type of [vision transformer](https://paperswithcode.com/methods/category/vision-transformer) which utilizes [conditional positional encoding](https://paperswithcode.com/method/conditional-positional-encoding). Other than the new encodings, it follows the same architecture of [ViT](https://paperswithcode.com/method/vision-transformer) and [DeiT](https://paperswithcode.com/method/deit).",
  "title": "Conditional Positional Encodings for Vision Transformers",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "NoisyNet-Dueling",
  "full_name": "NoisyNet-Dueling",
  "description": "**NoisyNet-Dueling** is a modification of a [Dueling Network](https://paperswithcode.com/method/dueling-network) that utilises noisy linear layers for exploration instead of $\\epsilon$-greedy exploration as in the original Dueling formulation.",
  "title": "Noisy Networks for Exploration",
  "collection": "Q-Learning Networks",
  "area": "Reinforcement Learning"
}
{
  "name": "GShard",
  "full_name": "GShard",
  "description": "**GShard** is a intra-layer parallel distributed method. It consists of set of simple APIs for annotations, and a compiler extension in XLA for automatic parallelization.",
  "title": "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "TAPAS",
  "full_name": "TAPAS",
  "description": "**TAPAS** is a weakly supervised question answering model that reasons over tables without generating logical forms. TAPAS predicts a minimal program by selecting a subset of the table cells and a possible aggregation operation to be executed on top of them. Consequently, TAPAS can learn operations from natural language, without the need to specify them in some formalism. This is implemented by extending [BERT](https://paperswithcode.com/method/bert)’s architecture with additional embeddings that capture tabular structure, and with two classification layers for selecting cells and predicting a corresponding aggregation operator.",
  "title": "TAPAS: Weakly Supervised Table Parsing via Pre-training",
  "collection": "Table Question Answering Models",
  "area": "Natural Language Processing"
}
{
  "name": "FiLM Module",
  "full_name": "FiLM Module",
  "description": "The **Feature-wise linear modulation** (**FiLM**) module combines information from both noisy waveform and input mel-spectrogram. It is used in the [WaveGrad](https://paperswithcode.com/method/wavegrad) model. The authors also added iteration index $n$ which indicates the noise level of the input waveform by using the [Transformer](https://paperswithcode.com/method/transformer) sinusoidal positional embedding. To condition on the noise level directly, $n$ is replaced by $\\sqrt{\\bar{\\alpha}}$ and a linear scale $C = 5000$ is applied. The FiLM module produces both scale and bias vectors given inputs, which are used in a UBlock for feature-wise affine transformation as:\r\n\r\n$$ \\gamma\\left(D, \\sqrt{\\bar{\\alpha}}\\right) \\odot U + \\zeta\\left(D, \\sqrt{\\bar{\\alpha}}\\right) $$\r\n\r\nwhere $\\gamma$ and $\\zeta$ correspond to the scaling and shift vectors from the FiLM module, $D$ is the output from corresponding [DBlock](https://paperswithcode.com/method/dblock), $U$ is an intermediate output in the UBlock.",
  "title": "WaveGrad: Estimating Gradients for Waveform Generation",
  "collection": "Audio Model Blocks",
  "area": "Audio"
}
{
  "name": "RFE",
  "full_name": "Rank Flow Embedding",
  "description": "",
  "title": null,
  "collection": "Image Retrieval Models",
  "area": "Computer Vision"
}
{
  "name": "QPT",
  "full_name": "Quantum Process Tomography",
  "description": "",
  "title": "0/1 Constrained Optimization Solving Sample Average Approximation for Chance Constrained Programming",
  "collection": "Value Function Estimation",
  "area": "Reinforcement Learning"
}
{
  "name": "TabNN",
  "full_name": "TabNN",
  "description": "TabNN is a universal neural network solution to derive effective NN architectures for tabular data in all kinds of tasks automatically. Specifically, the design of TabNN follows two principles: to explicitly leverage expressive feature combinations and to reduce model complexity. Since GBDT has empirically proven its strength in modeling tabular data, GBDT is used to power the implementation of TabNN.",
  "title": "TabNN: A Universal Neural Network Solution for Tabular Data",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "MobileDet",
  "full_name": "MobileDet",
  "description": "**MobileDet** is an object detection model developed for mobile accelerators. MobileDets uses regular convolutions extensively on EdgeTPUs and DSPs, especially in the early stage of the network where depthwise convolutions tend to be less efficient.  This helps boost the latency-accuracy trade-off for object detection on accelerators, provided that they are placed strategically in the network via [neural architecture search](https://paperswithcode.com/method/neural-architecture-search). By incorporating regular convolutions in the search space and directly optimizing the network architectures for object detection, an efficient family of object detection models is obtained.",
  "title": "MobileDets: Searching for Object Detection Architectures for Mobile Accelerators",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "AutoInt",
  "full_name": "AutoInt",
  "description": "**AutoInt** is a deep tabular learning method that models high-order feature interactions of input features. AutoInt can be applied to both numerical and categorical input features. Specifically, both the numerical and categorical features are mapped into the same low-dimensional space. Afterwards, a multi-head self-attentive neural network with residual connections is proposed to explicitly model the feature interactions in the low-dimensional space. With different layers of the multi-head self-attentive neural networks, different orders of feature combinations of input features can be modeled.",
  "title": "AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "PDC",
  "full_name": "Prime Dilated Convolution",
  "description": "",
  "title": "Sound2Synth: Interpreting Sound via FM Synthesizer Parameters Estimation",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "DyGED",
  "full_name": "Dynamic Graph Event Detection",
  "description": "",
  "title": "Event Detection on Dynamic Graphs",
  "collection": "Graph Representation Learning",
  "area": "Graphs"
}
{
  "name": "AdaGPR",
  "full_name": "AdaGPR",
  "description": "**AdaGPR** is an adaptive, layer-wise graph [convolution](https://paperswithcode.com/method/convolution) model. AdaGPR applies adaptive generalized Pageranks at each layer of a [GCNII](https://paperswithcode.com/method/gcnii) model by learning to predict the coefficients of generalized Pageranks using sparse solvers.",
  "title": "Layer-wise Adaptive Graph Convolution Networks Using Generalized Pagerank",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "3DIS",
  "full_name": "3-dimensional interaction space",
  "description": "A **trainable 3D interaction space** aims to captures the associations between the triplet components and helps model the recognition of multiple triplets in the same frame.\r\n\r\nSource: [Nwoye et al.](https://arxiv.org/pdf/2007.05405v1.pdf)\r\n\r\nImage source: [Nwoye et al.](https://arxiv.org/pdf/2007.05405v1.pdf)",
  "title": "Recognition of Instrument-Tissue Interactions in Endoscopic Videos via Action Triplets",
  "collection": "3D Representations",
  "area": "Computer Vision"
}
{
  "name": "MATE",
  "full_name": "MATE",
  "description": "**MATE** is a [Transformer](https://paperswithcode.com/method/transformer) architecture designed to model the structure of web tables. It uses sparse attention in a way that allows heads to efficiently attend to either rows or columns in a table. Each attention head reorders the tokens by either column or row index and then applies a windowed attention mechanism. Unlike traditional self-attention, Mate scales linearly in the sequence length.",
  "title": "MATE: Multi-view Attention for Table Transformer Efficiency",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Primer",
  "full_name": "Primer",
  "description": "**Primer** is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based architecture that improves upon the [Transformer](https://paperswithcode.com/method/transformer) architecture with two improvements found through [neural architecture search](https://paperswithcode.com/methods/category/neural-architecture-search): [squared RELU activations](https://paperswithcode.com/method/squared-relu) in the feedforward block, and [depthwise convolutions]() added to the attention multi-head projections: resulting in a new module called [Multi-DConv-Head-Attention](https://paperswithcode.com/method/multi-dconv-head-attention).",
  "title": "Primer: Searching for Efficient Transformers for Language Modeling",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "ComiRec",
  "full_name": "ComiRec",
  "description": "**ComiRec** is a multi-interest framework for sequential recommendation. The multi-interest module captures multiple interests from user behavior sequences, which can be exploited for retrieving candidate items from the large-scale item pool. These items are then fed into an aggregation module to obtain the overall recommendation. The aggregation module leverages a controllable factor to balance the recommendation accuracy and diversity.",
  "title": "Controllable Multi-Interest Framework for Recommendation",
  "collection": "Recommendation Systems",
  "area": "General"
}
{
  "name": "Weight Decay",
  "full_name": "Weight Decay",
  "description": "**Weight Decay**, or **$L_{2}$ Regularization**, is a regularization technique applied to the weights of a neural network. We minimize a loss function compromising both the primary loss function and a penalty on the $L\\_{2}$ Norm of the weights:\r\n\r\n$$L\\_{new}\\left(w\\right) = L\\_{original}\\left(w\\right) + \\lambda{w^{T}w}$$\r\n\r\nwhere $\\lambda$ is a value determining the strength of the penalty (encouraging smaller weights). \r\n\r\nWeight decay can be incorporated directly into the weight update rule, rather than just implicitly by defining it through to objective function. Often weight decay refers to the implementation where we specify it directly in the weight update rule (whereas L2 regularization is usually the implementation which is specified in the objective function).\r\n\r\nImage Source: Deep Learning, Goodfellow et al",
  "title": null,
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Hardtanh Activation",
  "full_name": "Hardtanh Activation",
  "description": "**Hardtanh** is an activation function used for neural networks:\r\n\r\n$$ f\\left(x\\right) = -1 \\text{ if } x < - 1 $$\r\n$$ f\\left(x\\right) = x \\text{ if } -1 \\leq x \\leq 1 $$\r\n$$ f\\left(x\\right) = 1 \\text{ if } x > 1 $$\r\n\r\nIt is a cheaper and more computationally efficient version of the [tanh activation](https://paperswithcode.com/method/tanh-activation).\r\n\r\nImage Source: [Zhuan Lan](https://zhuanlan.zhihu.com/p/30385380)",
  "title": null,
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Spatial Gating Unit",
  "full_name": "Spatial Gating Unit",
  "description": "**Spatial Gating Unit**, or **SGU**, is a gating unit used in the [gMLP](https://paperswithcode.com/method/gmlp) architecture to captures spatial interactions. To enable cross-token interactions, it is necessary for the layer $s(\\cdot)$ to contain a contraction operation over the spatial dimension. The layer $s(\\cdot)$ is formulated as the output of linear gating:\r\n\r\n$$\r\ns(Z)=Z \\odot f\\_{W, b}(Z)\r\n$$\r\n\r\nwhere $\\odot$ denotes element-wise multiplication. For training stability, the authors find it critical to initialize $W$ as near-zero values and $b$ as ones, meaning that $f\\_{W, b}(Z) \\approx 1$ and therefore $s(Z) \\approx Z$ at the beginning of training. This initialization ensures each [gMLP](https://paperswithcode.com/method/gmlp) block behaves like a regular [FFN](https://paperswithcode.com/method/gmlp) at the early stage of training, where each token is processed independently, and only gradually injects spatial information across tokens during the course of learning.\r\n\r\nThe authors find it further effective to split $Z$ into two independent parts $\\left(Z\\_{1}, Z\\_{2}\\right)$ along the channel dimension for the gating function and for the multiplicative bypass:\r\n\r\n$$\r\ns(Z)=Z\\_{1} \\odot f\\_{W, b}\\left(Z\\_{2}\\right)\r\n$$\r\n\r\nThey also normalize the input to $f\\_{W, b}$ which empirically improved the stability of large NLP models.",
  "title": "Pay Attention to MLPs",
  "collection": "Feedforward Networks",
  "area": "General"
}
{
  "name": "Distributional Generalization",
  "full_name": "Distributional Generalization",
  "description": "**Distributional Generalization** is a type of generalization that roughly states that outputs of a classifier at train and test time are close as distributions, as opposed to close in just their average error. This behavior is not captured by classical generalization, which would only consider the average error and not the distribution of errors over the input domain.",
  "title": "Distributional Generalization: A New Kind of Generalization",
  "collection": "Generalization",
  "area": "General"
}
{
  "name": "Softsign Activation",
  "full_name": "Softsign Activation",
  "description": "**Softsign** is an activation function for neural networks:\r\n\r\n$$ f\\left(x\\right) = \\left(\\frac{x}{|x|+1}\\right)$$\r\n\r\nImage Source: [Sefik Ilkin Serengil](https://sefiks.com/2017/11/10/softsign-as-a-neural-networks-activation-function/)",
  "title": null,
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Non-Local Operation",
  "full_name": "Non-Local Operation",
  "description": "A **Non-Local Operation** is a component for capturing long-range dependencies with deep neural networks. It is a generalization of the classical non-local mean operation in computer vision. Intuitively a non-local operation computes the response at a position as a weighted sum of the features at all positions in the input feature maps. The set of positions can be in space, time, or spacetime, implying that these operations are applicable for image, sequence, and video problems.\r\n\r\nFollowing the non-local mean operation, a generic non-local operation for deep neural networks is defined as:\r\n\r\n$$ \\mathbb{y}\\_{i} = \\frac{1}{\\mathcal{C}\\left(\\mathbb{x}\\right)}\\sum\\_{\\forall{j}}f\\left(\\mathbb{x}\\_{i}, \\mathbb{x}\\_{j}\\right)g\\left(\\mathbb{x}\\_{j}\\right) $$\r\n\r\nHere $i$ is the index of an output position (in space, time, or spacetime) whose response is to be computed and $j$ is the index that enumerates all possible positions. x is the input signal (image, sequence, video; often their features) and $y$ is the output signal of the same size as $x$. A pairwise function $f$ computes a scalar (representing relationship such as affinity) between $i$ and all $j$. The unary function $g$ computes a representation of the input signal at the position $j$. The\r\nresponse is normalized by a factor $C\\left(x\\right)$.\r\n\r\nThe non-local behavior is due to the fact that all positions ($\\forall{j}$) are considered in the operation. As a comparison, a convolutional operation sums up the weighted input in a local neighborhood (e.g., $i − 1 \\leq j \\leq i + 1$ in a 1D case with kernel size 3), and a recurrent operation at time $i$ is often based only on the current and the latest time steps (e.g., $j = i$ or $i − 1$).\r\n\r\nThe non-local operation is also different from a fully-connected (fc) layer. The equation above computes responses based on relationships between different locations, whereas fc uses learned weights. In other words, the relationship between $x\\_{j}$ and $x\\_{i}$ is not a function of the input data in fc, unlike in nonlocal layers. Furthermore, the formulation in the equation above supports inputs of variable sizes, and maintains the corresponding size in the output. On the contrary, an fc layer requires a fixed-size input/output and loses positional correspondence (e.g., that from $x\\_{i}$ to $y\\_{i}$ at the position $i$).\r\n\r\nA non-local operation is a flexible building block and can be easily used together with convolutional/recurrent layers. It can be added into the earlier part of deep neural networks, unlike fc layers that are often used in the end. This allows us to build a richer hierarchy that combines both non-local and local information.\r\n\r\nIn terms of parameterisation, we usually parameterise $g$ as a linear embedding of the form $g\\left(x\\_{j}\\right) = W\\_{g}\\mathbb{x}\\_{j}$ , where $W\\_{g}$ is a weight matrix to be learned. This is implemented as, e.g., 1×1 [convolution](https://paperswithcode.com/method/convolution) in space or 1×1×1 convolution in spacetime. For $f$ we use an affinity function, a list of which can be found [here](https://paperswithcode.com/methods/category/affinity-functions).",
  "title": "Non-local Neural Networks",
  "collection": "Image Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "SRGAN",
  "full_name": "SRGAN",
  "description": "**SRGAN** is a generative adversarial network for single image super-resolution. It uses a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes the solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. In addition, the authors use a content loss motivated by perceptual similarity instead of similarity in pixel space. The actual networks - depicted in the Figure to the right - consist mainly of residual blocks for feature extraction.\r\n\r\nFormally we write the perceptual loss function as a weighted sum of a ([VGG](https://paperswithcode.com/method/vgg)) content loss $l^{SR}\\_{X}$ and an adversarial loss component $l^{SR}\\_{Gen}$:\r\n\r\n$$ l^{SR} = l^{SR}\\_{X} + 10^{-3}l^{SR}\\_{Gen} $$",
  "title": "Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network",
  "collection": "Generative Adversarial Networks",
  "area": "Computer Vision"
}
{
  "name": "ALQ and AMQ",
  "full_name": "Gradient Quantization with Adaptive Levels/Multiplier",
  "description": "Many communication-efficient variants of [SGD](https://paperswithcode.com/method/sgd) use gradient quantization schemes. These schemes are often heuristic and fixed over the course of training. We empirically observe that the statistics of gradients of deep models change during the training. Motivated by this observation, we introduce two adaptive quantization schemes, ALQ and AMQ. In both schemes, processors update their compression schemes in parallel by efficiently computing sufficient statistics of a parametric distribution. We improve the validation accuracy by almost 2% on CIFAR-10 and 1% on ImageNet in challenging low-cost communication setups. Our adaptive methods are also significantly more robust to the choice of hyperparameters.",
  "title": "Adaptive Gradient Quantization for Data-Parallel SGD",
  "collection": "Data Parallel Methods",
  "area": "General"
}
{
  "name": "CPM-2",
  "full_name": "CPM-2",
  "description": "**CPM-2** is a 11 billion parameters pre-trained language model based on a standard Transformer architecture consisting of a bidirectional encoder and a unidirectional decoder. The model is pre-trained on WuDaoCorpus which contains 2.3TB cleaned Chinese data as well as 300GB cleaned English data. The pre-training process of CPM-2 can be divided into three stages: Chinese pre-training, bilingual pre-training, and MoE pre-training. Multi-stage training with knowledge inheritance can significantly reduce the computation cost.",
  "title": "CPM-2: Large-scale Cost-effective Pre-trained Language Models",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "Laplacian Pyramid",
  "full_name": "Laplacian Pyramid",
  "description": "A **Laplacian Pyramid** is a linear invertible image representation consisting of a set of band-pass\r\nimages spaced an octave apart, plus a low-frequency residual. Formally, let $d\\left(.\\right)$ be a downsampling operation that blurs and decimates a $j \\times j$ image $I$ so that $d\\left(I\\right)$ is a new image of size $\\frac{j}{2} \\times \\frac{j}{2}$. Also, let $u\\left(.\\right)$ be an upsampling operator which smooths and expands $I$ to be twice the size, so $u\\left(I\\right)$ is a new image of size $2j \\times 2j$. We first build a Gaussian pyramid $G\\left(I\\right) = \\left[I\\_{0}, I\\_{1}, \\dots, I\\_{K}\\right]$, where\r\n$I\\_{0} = I$ and $I\\_{k}$ is $k$ repeated application of $d\\left(.\\right)$ to $I$. $K$ is the number of levels in the pyramid selected so that the final level has a minimal spatial extent ($\\leq 8 \\times 8$ pixels).\r\n\r\nThe coefficients $h\\_{k}$ at each level $k$ of the Laplacian pyramid $L\\left(I\\right)$ are constructed by taking the difference between adjacent levels in the Gaussian pyramid, upsampling the smaller one with $u\\left(.\\right)$ so that the sizes are compatible:\r\n\r\n$$ h\\_{k} = \\mathcal{L}\\_{k}\\left(I\\right) = G\\_{k}\\left(I\\right) − u\\left(G\\_{k+1}\\left(I\\right)\\right) = I\\_{k} − u\\left(I\\_{k+1}\\right) $$\r\n\r\nIntuitively, each level captures the image structure present at a particular scale. The final level of the\r\nLaplacian pyramid $h\\_{K}$ is not a difference image, but a low-frequency residual equal to the final\r\nGaussian pyramid level, i.e. $h\\_{K} = I\\_{K}$. Reconstruction from a Laplacian pyramid coefficients\r\n$\\left[h\\_{1}, \\dots, h\\_{K}\\right]$ is performed using the backward recurrence:\r\n\r\n$$ I\\_{k} = u\\left(I\\_{k+1}\\right) + h\\_{k} $$\r\n\r\nwhich is started with $I\\_{K} = h\\_{K}$ and the reconstructed image being $I = I\\_{o}$. In other words, starting at the coarsest level, we repeatedly upsample and add the difference image h at the next finer level until we return to the full-resolution image.\r\nSource: [LAPGAN](https://paperswithcode.com/method/lapgan)\r\n\r\nImage : [Design of FIR Filters for Fast Multiscale Directional Filter Banks](https://www.researchgate.net/figure/Relationship-between-Gaussian-and-Laplacian-Pyramids_fig2_275038450)",
  "title": null,
  "collection": "Image Representations",
  "area": "Computer Vision"
}
{
  "name": "RandWire",
  "full_name": "RandWire",
  "description": "**RandWire** is a type of convolutional neural network that arise from randomly\r\nwired neural networks that are sampled from stochastic network generators, in which a human-designed random\r\nprocess defines generation.",
  "title": "Exploring Randomly Wired Neural Networks for Image Recognition",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "CoBERL",
  "full_name": "Contrastive BERT",
  "description": "**Contrastive BERT** is a reinforcement learning agent that combines a new contrastive loss and a hybrid [LSTM](https://paperswithcode.com/method/lstm)-[transformer](https://paperswithcode.com/method/transformer) architecture to tackle the challenge of improving data efficiency for RL. It uses bidirectional masked prediction in combination with a generalization of recent contrastive methods to learn better representations for transformers in RL, without the need of hand engineered data augmentations.\r\n\r\nFor the architecture, a residual network is used to encode observations into embeddings $Y\\_{t}$. $Y_{t}$  is fed through a causally masked [GTrXL transformer](https://www.paperswithcode.com/method/gtrxl), which computes the predicted masked inputs $X\\_{t}$ and passes those together with $Y\\_{t}$ to a learnt gate. The output of the gate is passed through a single [LSTM](https://www.paperswithcode.com/method/lstm) layer to produce the values that we use for computing the RL loss. A contrastive loss is computed using predicted masked inputs $X_{t}$ and $Y_{t}$ as targets. For this, we do not use the causal mask of the Transformer.",
  "title": "CoBERL: Contrastive BERT for Reinforcement Learning",
  "collection": "RL Transformers",
  "area": "Reinforcement Learning"
}
{
  "name": "RBPN",
  "full_name": "Recurrent Back Projection Network",
  "description": "",
  "title": "Recurrent Back-Projection Network for Video Super-Resolution",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "GALA",
  "full_name": "Global-and-Local attention",
  "description": "Most attention mechanisms learn where to focus using only weak supervisory signals from class labels, which inspired Linsley et al. to investigate how explicit human supervision can affect the performance and interpretability of attention models. As a proof of concept, Linsley et al. proposed the global-and-local attention (GALA) module, which extends an SE block with a spatial attention mechanism.\r\n\r\nGiven the input feature map $X$, GALA uses an attention mask that combines global and local attention to tell the network where and on what to focus. As in SE blocks, global attention aggregates global information by global average pooling and then produces a channel-wise attention weight vector using a multilayer perceptron. In local attention, two consecutive $1\\times 1$ convolutions are conducted on the input to produce a positional weight map. The outputs of the local and global pathways are combined by addition and multiplication. Formally, GALA can be represented as:\r\n\\begin{align}\r\n    s_g &= W_{2} \\delta (W_{1}\\text{GAP}(x))\r\n\\end{align}\r\n\r\n\\begin{align}\r\n    s_l &= Conv_2^{1\\times 1} (\\delta(Conv_1^{1\\times1}(X)))\r\n\\end{align}\r\n\r\n\\begin{align}\r\n    s_g^* &= \\text{Expand}(s_g)\r\n\\end{align}\r\n\r\n\\begin{align}\r\n    s_l^* &= \\text{Expand}(s_l) \r\n\\end{align}\r\n\r\n\\begin{align}\r\n    s &= \\tanh(a(s_g^\\* + s_l^\\*) +m \\cdot (s_g^\\* s_l^\\*) )\r\n\\end{align}\r\n\r\n\\begin{align}\r\n    Y &= sX\r\n\\end{align}\r\n\r\nwhere $a,m \\in \\mathbb{R}^{C}$ are learnable parameters representing channel-wise weight vectors. \r\n\r\nSupervised by human-provided feature importance maps, GALA has significantly improved representational power and can be combined with any CNN backbone.",
  "title": "Learning what and where to attend",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "MDTVSFA",
  "full_name": "MDTVSFA",
  "description": "",
  "title": "Unified Quality Assessment of In-the-Wild Videos with Mixed Datasets Training",
  "collection": "Video Quality Models",
  "area": "Computer Vision"
}
{
  "name": "Global Context Block",
  "full_name": "Global Context Block",
  "description": "A **Global Context Block** is an image model block for global context modeling. The aim is to have both the benefits of the simplified [non-local block](https://paperswithcode.com/method/non-local-block) with effective modeling of long-range dependencies, and the [squeeze-excitation block](https://paperswithcode.com/method/squeeze-and-excitation-block) with lightweight computation. \r\n\r\nIn the Global Context framework, we have (a) global attention pooling, which adopts a [1x1 convolution](https://paperswithcode.com/method/1x1-convolution) $W_{k}$ and [softmax](https://paperswithcode.com/method/softmax) function to obtain the attention weights, and then performs the attention pooling to obtain the global context features, (b) feature transform via a 1x1 [convolution](https://paperswithcode.com/method/convolution) $W\\_{v}$; (c) feature aggregation, which employs addition to aggregate the global context features to the features of each position. Taken as a whole, the GC block is proposed as a lightweight way to achieve global context modeling.",
  "title": "GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Softmax",
  "full_name": "Softmax",
  "description": "The **Softmax** output function transforms a previous layer's output into a vector of probabilities. It is commonly used for multiclass classification.  Given an input vector $x$ and a weighting vector $w$ we have:\r\n\r\n$$ P(y=j \\mid{x}) = \\frac{e^{x^{T}w_{j}}}{\\sum^{K}_{k=1}e^{x^{T}wk}} $$",
  "title": null,
  "collection": "Output Functions",
  "area": "General"
}
{
  "name": "Pattern-Exploiting Training",
  "full_name": "Pattern-Exploiting Training",
  "description": "**Pattern-Exploiting Training** is a semi-supervised training procedure that reformulates input examples as cloze-style phrases to help language models understand a given task. These phrases are then used to assign soft labels to a large set of unlabeled examples. Finally, standard supervised training is performed on the resulting training set. \r\n\r\nIn the case of PET for sentiment classification, first a number of patterns encoding some form of task description are created to convert training examples to cloze questions; for each pattern, a pretrained language model is finetuned. Secondly, the ensemble of trained models annotates unlabeled data. Lastly, a classifier is trained on the resulting soft-labeled dataset.",
  "title": "Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "Temporally Consistent Spatial Augmentation",
  "full_name": "Temporally Consistent Spatial Augmentation",
  "description": "**Temporally Consistent Spatial Augmentation** is a video data augmentation technique used for contrastive learning in the [Contrastive Video Representation Learning](https://paperswithcode.com/method/cvrl) framework. It fixes the randomness of spatial augmentation across frames; this prevents spatial augmentation hurting learning if applied independently across frames, because in that case it breaks the natural motion. In contrast, having temporally consistent spatial augmentation does not break the natural motion in the frames.",
  "title": "Spatiotemporal Contrastive Video Representation Learning",
  "collection": "Video Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "MPSO",
  "full_name": "Motion-Encoded Particle Swarm Optimization",
  "description": "",
  "title": "Motion-Encoded Particle Swarm Optimization for Moving Target Search Using UAVs",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "DELG",
  "full_name": "DELG",
  "description": "**DELG** is a convolutional neural network for image retrieval that combines generalized mean pooling for global features and attentive selection for local features. The entire network can be learned end-to-end by carefully balancing the gradient flow between two heads – requiring only image-level labels. This allows for efficient inference by extracting an image’s global feature, detected keypoints and local descriptors within a single model.\r\n\r\nThe model is enabled by leveraging hierarchical image representations that arise in [CNNs](https://paperswithcode.com/methods/category/convolutional-neural-networks), which are coupled to [generalized mean pooling](https://paperswithcode.com/method/generalized-mean-pooling) and attentive local feature detection. Secondly, a convolutional autoencoder module is adopted that can successfully learn low-dimensional local descriptors. This can be readily integrated into the unified model, and avoids the need of post-processing learning steps, such as [PCA](https://paperswithcode.com/method/pca), that are commonly used. Finally, a procedure is used that enables end-to-end training of the proposed model using only image-level supervision. This requires carefully controlling the gradient flow between the global and local network heads during backpropagation, to avoid disrupting the desired representations.",
  "title": "Unifying Deep Local and Global Features for Image Search",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Pixel-BERT",
  "full_name": "Pixel-BERT",
  "description": "Pixel-BERT is a pre-trained model trained to align image pixels with text. The end-to-end framework includes a CNN-based visual encoder and cross-modal transformers for visual and language embedding learning.\r\nThis model has three parts: one fully convolutional neural network that takes pixels of an image as input, one word-level token embedding based on BERT, and a multimodal transformer for jointly learning visual and language embedding.\r\n\r\nFor language, it uses other pretraining works to use Masked Language Modeling (MLM) to predict masked tokens with surrounding text and images. For vision, it uses the random pixel sampling mechanism that makes up for the challenge of predicting pixel-level features. This mechanism is also suitable for solving overfitting issues and improving the robustness of visual features. \r\n\r\nIt applies Image-Text Matching (ITM) to classify whether an image and a sentence pair match for vision and language interaction. \r\n\r\nImage captioning is required to understand language and visual semantics for cross-modality tasks like VQA. Region-based visual features extracted from object detection models like Faster RCNN are used for better performance in the newer version of the model.",
  "title": "Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "Mixture of Softmaxes",
  "full_name": "Mixture of Softmaxes",
  "description": "**Mixture of Softmaxes** performs $K$ different softmaxes and mixes them. The motivation is that the traditional [softmax](https://paperswithcode.com/method/softmax) suffers from a softmax bottleneck, i.e. the expressiveness of the conditional probability we can model is constrained by the combination of a dot product and the softmax. By using a mixture of softmaxes, we can model the conditional probability more expressively.",
  "title": "Breaking the Softmax Bottleneck: A High-Rank RNN Language Model",
  "collection": "Output Functions",
  "area": "General"
}
{
  "name": "NICE",
  "full_name": "Non-linear Independent Component Estimation",
  "description": "**NICE**, or **Non-Linear Independent Components Estimation** is a framework for modeling complex high-dimensional densities. It is based on the idea that a good representation is one in which the data has a distribution that is easy to model. For this purpose, a non-linear deterministic transformation of the data is learned that maps it to a latent space so as to make the transformed data conform to a factorized distribution, i.e., resulting in independent latent variables.  The transformation is parameterised so that computing the determinant of the Jacobian and inverse Jacobian is trivial, yet it maintains the ability to learn complex non-linear transformations, via a composition of simple building blocks, each based on a deep neural network. The training criterion is simply the exact log-likelihood. The transformation used in NICE is the [affine coupling](https://paperswithcode.com/method/affine-coupling) layer without the scale term, known as additive coupling layer:\r\n\r\n$$ y\\_{I\\_{2}} = x\\_{I\\_{2}} + m\\left(x\\_{I\\_{1}}\\right) $$\r\n\r\n$$ x\\_{I\\_{2}} = y\\_{I\\_{2}} + m\\left(y\\_{I\\_{1}}\\right) $$",
  "title": "NICE: Non-linear Independent Components Estimation",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "MnasNet",
  "full_name": "MnasNet",
  "description": "**MnasNet** is a type of convolutional neural network optimized for mobile devices that is discovered through mobile [neural architecture search](https://paperswithcode.com/method/neural-architecture-search), which explicitly incorporates model latency into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency. The main building block is an [inverted residual block](https://paperswithcode.com/method/inverted-residual-block) (from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2)).",
  "title": "MnasNet: Platform-Aware Neural Architecture Search for Mobile",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Deep-CAPTCHA",
  "full_name": "Deep-CAPTCHA",
  "description": "",
  "title": "Deep-CAPTCHA: a deep learning based CAPTCHA solver for vulnerability assessment",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Factorized Random Synthesized Attention",
  "full_name": "Factorized Random Synthesized Attention",
  "description": "**Factorized Random Synthesized Attention**, introduced with the [Synthesizer](https://paperswithcode.com/method/synthesizer) architecture, is similar to [factorized dense synthesized attention](https://paperswithcode.com/method/factorized-dense-synthesized-attention) but for random synthesizers. Letting $R$ being a randomly initialized matrix, we factorize $R$ into low rank matrices $R\\_{1}, R\\_{2} \\in \\mathbb{R}^{l\\text{ x}k}$ in the attention function:\r\n\r\n$$ Y = \\text{Softmax}\\left(R\\_{1}R\\_{2}^{T}\\right)G\\left(X\\right) . $$\r\n\r\nHere $G\\left(.\\right)$ is a parameterized function that is equivalent to $V$ in [Scaled Dot-Product Attention](https://paperswithcode.com/method/scaled).\r\n\r\nFor each head, the factorization reduces the parameter costs from $l^{2}$ to $2\\left(lk\\right)$ where\r\n$k << l$ and hence helps prevent overfitting. In practice, we use a small value of $k = 8$.\r\n\r\nThe basic idea of a  Random Synthesizer is to not rely on pairwise token interactions or any information from individual token but rather to learn a task-specific alignment that works well globally across many samples.",
  "title": "Synthesizer: Rethinking Self-Attention in Transformer Models",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Sinkhorn Transformer",
  "full_name": "Sinkhorn Transformer",
  "description": "The **Sinkhorn Transformer** is a type of [transformer](https://paperswithcode.com/method/transformer) that uses [Sparse Sinkhorn Attention](https://paperswithcode.com/method/sparse-sinkhorn-attention) as a building block. This component is a plug-in replacement for dense fully-connected attention (as well as local attention, and sparse attention alternatives), and allows for reduced memory complexity as well as sparse attention.",
  "title": "Sparse Sinkhorn Attention",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "PixelShuffle",
  "full_name": "PixelShuffle",
  "description": "**PixelShuffle** is an operation used in super-resolution models to implement efficient sub-pixel convolutions with a stride of $1/r$. Specifically it rearranges elements in a tensor of shape $(\\*, C \\times r^2, H, W)$ to a tensor of shape $(\\*, C, H \\times r, W \\times r)$.\r\n\r\nImage Source: [Remote Sensing Single-Image Resolution Improvement Using A Deep Gradient-Aware Network with Image-Specific Enhancement](https://www.researchgate.net/figure/The-pixel-shuffle-layer-transforms-feature-maps-from-the-LR-domain-to-the-HR-image_fig3_339531308)",
  "title": "Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network",
  "collection": "Miscellaneous Components",
  "area": "General"
}
{
  "name": "UNIMO",
  "full_name": "UNIMO",
  "description": "**UNIMO** is a multi-modal pre-training architecture that can effectively adapt to both single modal and multimodal understanding and generation tasks. UNIMO learns visual representations and textual representations simultaneously, and unifies them into the same semantic space via [cross-modal contrastive learning](https://paperswithcode.com/method/cmcl) (CMCL) based on a large-scale corpus of image collections, text corpus and image-text pairs. The CMCL aligns the visual representation and textual representation, and unifies them into the same semantic\r\nspace based on image-text pairs.",
  "title": "UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning",
  "collection": "Multi-Modal Methods",
  "area": "Computer Vision"
}
{
  "name": "Accuracy-Robustness Area (ARA)",
  "full_name": "Accuracy-Robustness Area",
  "description": "In the space of adversarial perturbation against classifier accuracy, the ARA is the area between a classifier's curve and the straight line defined by a naive classifier's maximum accuracy. Intuitively, the ARA measures a combination of the classifier’s predictive power and its ability to overcome an adversary. Importantly, when contrasted against existing robustness metrics, the ARA takes into account the classifier’s performance against all adversarial examples, without  bounding them by some arbitrary $\\epsilon$.",
  "title": "Adversarial Explanations for Understanding Image Classification Decisions and Improved Neural Network Robustness",
  "collection": "Adversarial Training",
  "area": "General"
}
{
  "name": "Tofu",
  "full_name": "Tofu",
  "description": "**Tofu** is an intra-layer model parallel system that partitions very large DNN models across multiple GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow graph of fine-grained tensor operators used by platforms like MXNet and TensorFlow. To optimally partition different operators in a dataflow graph, Tofu uses a recursive search algorithm that minimizes the total communication cost.",
  "title": "Supporting Very Large Models using Automatic Dataflow Graph Partitioning",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "Slanted Triangular Learning Rates",
  "full_name": "Slanted Triangular Learning Rates",
  "description": "**Slanted Triangular Learning Rates (STLR)** is a learning rate schedule which first linearly increases the learning rate and then linearly decays it, which can be seen in Figure to the right. It is a modification of Triangular Learning Rates, with a short increase and a long decay period.",
  "title": "Universal Language Model Fine-tuning for Text Classification",
  "collection": "Learning Rate Schedules",
  "area": "General"
}
{
  "name": "Characteristic Functions",
  "full_name": "Characteristic Function Estimation for Discrete Probability Distributions",
  "description": "",
  "title": "Applications of the discrete-time Fourier transform to data analysis",
  "collection": "Fourier-related Transforms",
  "area": "General"
}
{
  "name": "Visual Attention",
  "full_name": "Visual Attention",
  "description": "",
  "title": "InferNER: an attentive model leveraging the sentence-level information for Named Entity Recognition in Microblogs",
  "collection": "Attention",
  "area": "General"
}
{
  "name": "RoIAlign",
  "full_name": "RoIAlign",
  "description": "**Region of Interest Align**, or **RoIAlign**, is an operation for extracting a small feature map from each RoI in detection and segmentation based tasks. It removes the harsh quantization of [RoI Pool](https://paperswithcode.com/method/roi-pooling), properly *aligning* the extracted features with the input. To avoid any quantization of the RoI boundaries or bins (using $x/16$ instead of $[x/16]$), RoIAlign uses bilinear interpolation to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and the result is then aggregated (using max or average).",
  "title": "Mask R-CNN",
  "collection": "RoI Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "EfficientDet",
  "full_name": "EfficientDet",
  "description": "**EfficientDet** is a type of object detection model, which utilizes several optimization and backbone tweaks, such as the use of a [BiFPN](https://paperswithcode.com/method/bifpn), and a compound scaling method that uniformly scales the resolution,depth and width for all backbones, feature networks and box/class prediction networks at the same time.",
  "title": "EfficientDet: Scalable and Efficient Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "HITNet",
  "full_name": "HITNet",
  "description": "**HITNet** is a framework for neural network based depth estimation which overcomes the computational disadvantages of operating on a 3D volume by integrating image warping, spatial propagation and a fast high resolution initialization step into the network architecture, while keeping the flexibility of a learned representation by allowing features to flow through the network. The main idea of the approach is to represent image tiles as planar patches which have a learned compact feature descriptor attached to them. The basic principle of the approach is to fuse information from the high resolution initialization and the current hypotheses using spatial propagation. The propagation is implemented via a [convolutional neural network](https://paperswithcode.com/methods/category/convolutional-neural-networks) module that updates the estimate of the planar patches and their attached features. \r\n\r\nIn order for the network to iteratively increase the accuracy of the disparity predictions, the network is provided a local cost volume in a narrow band (±1 disparity) around the planar patch using in-network image warping allowing the network to minimize image dissimilarity. To reconstruct fine details while also capturing large texture-less areas we start at low resolution and hierarchically upsample predictions to higher resolution. A critical feature of the architecture is that at each resolution, matches from the initialization module are provided to facilitate recovery of thin structures that cannot be represented at low resolution.",
  "title": "HITNet: Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching",
  "collection": "Stereo Depth Estimation Models",
  "area": "Computer Vision"
}
{
  "name": "RSU Beneš Block",
  "full_name": "Beneš Block with Residual Switch Units",
  "description": "The **Beneš block** is a computation-efficient alternative to dense attention, enabling the modelling of long-range dependencies in O(n log n) time. In comparison, dense attention which is commonly used in Transformers has O(n^2) complexity.\r\n\r\nIn music, dependencies occur on several scales, including on a coarse scale which requires processing very long sequences. Beneš blocks have been used in Residual Shuffle-Exchange Networks to achieve state-of-the-art results in music transcription.\r\n\r\nBeneš blocks have a ‘receptive field’ of the size of the whole sequence, and it has no bottleneck. These properties hold for dense attention but have not been shown for many sparse attention and dilated convolutional architectures.",
  "title": "Residual Shuffle-Exchange Networks for Fast Processing of Long Sequences",
  "collection": "Audio Model Blocks",
  "area": "Audio"
}
{
  "name": "RAG",
  "full_name": "RAG",
  "description": "**Retriever-Augmented Generation**, or **RAG**, is a type of language generation model that combines pre-trained parametric and non-parametric memory for language generation. Specifically, the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever.  For query $x$, Maximum Inner Product Search (MIPS) is used to find the top-K documents $z\\_{i}$. For final prediction $y$, we treat $z$ as a latent variable and marginalize over seq2seq predictions given different documents.",
  "title": "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "DDSP",
  "full_name": "Differentiable Digital Signal Processing",
  "description": "",
  "title": "DDSP: Differentiable Digital Signal Processing",
  "collection": "Generative Audio Models",
  "area": "Audio"
}
{
  "name": "Agglomerative Contextual Decomposition",
  "full_name": "Agglomerative Contextual Decomposition",
  "description": "**Agglomerative Contextual Decomposition (ACD)** is an interpretability method that produces hierarchical interpretations for a single prediction made by a neural network, by scoring interactions and building them into a tree. Given a prediction from a trained neural network, ACD produces a hierarchical clustering of the input features, along with the contribution of each cluster to the final prediction. This hierarchy is optimized to identify clusters of features that the DNN learned are predictive.",
  "title": "Hierarchical interpretations for neural network predictions",
  "collection": "Interpretability",
  "area": "General"
}
{
  "name": "PyTorch DDP",
  "full_name": "PyTorch DDP",
  "description": "**PyTorch DDP** (Distributed Data Parallel) is a distributed data parallel implementation for PyTorch. To guarantee mathematical equivalence, all replicas start from the same initial values for model parameters and synchronize gradients to keep parameters consistent across training iterations. To minimize the intrusiveness, the implementation exposes the same forward API as the user model, allowing applications to seamlessly replace subsequent occurrences of a user model with the distributed data parallel model object with no additional code changes. Several techniques are integrated into the design to deliver high-performance training, including bucketing gradients, overlapping communication with computation, and skipping synchronization.",
  "title": "PyTorch Distributed: Experiences on Accelerating Data Parallel Training",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "Zoneout",
  "full_name": "Zoneout",
  "description": "**Zoneout** is a  method for regularizing [RNNs](https://paperswithcode.com/methods/category/recurrent-neural-networks). At each timestep, zoneout stochastically forces some hidden units to maintain their previous values. Like [dropout](https://paperswithcode.com/method/dropout), zoneout uses random noise to train a pseudo-ensemble, improving generalization.\r\nBut by preserving instead of dropping hidden units, gradient information and state information are more readily propagated through time, as in feedforward [stochastic depth](https://paperswithcode.com/method/stochastic-depth) networks.",
  "title": "Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "SAGA",
  "full_name": "SAGA",
  "description": "SAGA is a method in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA, SAGA supports non-strongly convex problems directly, and is adaptive to any inherent strong convexity of the problem.",
  "title": "SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives",
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "Self-Attention Guidance",
  "full_name": "Self-Attention Guidance",
  "description": "",
  "title": "Improving Sample Quality of Diffusion Models Using Self-Attention Guidance",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "RGA",
  "full_name": "Relation-aware Global Attention",
  "description": "In relation-aware global attention (RGA) stresses the importance of global structural information provided by pairwise relations, and uses it to produce attention maps. \r\n\r\nRGA comes in two forms,  spatial RGA (RGA-S) and channel RGA (RGA-C). RGA-S first reshapes the input feature map $X$ to $C\\times (H\\times W)$ and the pairwise relation matrix $R \\in \\mathbb{R}^{(H\\times W)\\times (H\\times W)}$ is computed using \r\n\\begin{align}\r\n    Q &= \\delta(W^QX) \r\n\\end{align}\r\n\\begin{align}\r\n    K &= \\delta(W^KX) \r\n\\end{align}\r\n\\begin{align}\r\n    R &= Q^TK\r\n\\end{align}\r\nThe relation vector $r_i$ at position $i$ is defined by stacking  pairwise relations at all positions:\r\n\\begin{align}\r\n    r_i = [R(i, :); R(:,i)]    \r\n\\end{align}\r\nand the spatial relation-aware feature $y_i$ can be written as\r\n\\begin{align}\r\n    Y_i = [g^c_\\text{avg}(\\delta(W^\\varphi x_i)); \\delta(W^\\phi r_i)]\r\n\\end{align}\r\nwhere $g^c_\\text{avg}$ denotes global average pooling in the channel domain. Finally, the spatial attention score at position $i$ is given by \r\n\\begin{align}\r\n    a_i = \\sigma(W_2\\delta(W_1y_i))\r\n\\end{align}\r\nRGA-C has the same form as RGA-S, except for taking the input feature map as a set of $H\\times W$-dimensional features.\r\n\r\nRGA uses global relations to generate the attention score for each feature node,  so provides valuable structural information and significantly enhances the representational power. RGA-S and RGA-C are flexible enough to be used in any CNN network; Zhang et al. propose  using them jointly in sequence to better capture both spatial and cross-channel relationships.",
  "title": "Relation-Aware Global Attention for Person Re-identification",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "ExtremeNet",
  "full_name": "ExtremeNet",
  "description": "**ExtremeNet** is a a bottom-up object detection framework that detects four extreme points (top-most, left-most, bottom-most, right-most) of an object. It uses a keypoint estimation framework to find extreme points, by predicting four multi-peak heatmaps for each object category. In addition, it uses one [heatmap](https://paperswithcode.com/method/heatmap) per category predicting the object center, as the average of two bounding box edges in both the x and y dimension. We group extreme points into objects with a purely geometry-based approach. We group four extreme points, one from each map, if and only if their\r\ngeometric center is predicted in the center heatmap with a score higher than a pre-defined threshold, We enumerate all $O\\left(n^{4}\\right)$ combinations of extreme point prediction, and select the valid ones.",
  "title": "Bottom-up Object Detection by Grouping Extreme and Center Points",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "AM",
  "full_name": "Attention Model",
  "description": "",
  "title": "Attention, Learn to Solve Routing Problems!",
  "collection": "Reinforcement Learning Frameworks",
  "area": "Reinforcement Learning"
}
{
  "name": "Wavelet Distributed Training",
  "full_name": "Wavelet Distributed Training",
  "description": "**Wavelet** is an asynchronous data parallel approach that interleaves waves of training tasks on the same group of GPUs, such that tasks belong to one wave can leverage on-device memory from tasks in another wave during their memory valley period, thus boost-up the training throughput. As shown in the Figure, Wavelet divides dataparallel training tasks into two waves, namely tick-wave and tock-wave. The task launching offset is achieved by delaying the launch time of tock-wave tasks for half of a whole forward-backward training cycle. Therefore, the tock-wave tasks can directly leverage GPU memory valley period of tick-wave tasks (e.g. 0.4s-0.6s in Figure 2(a)), since backward propagation of tick-wave tasks is compute-heavy but memory is often unused. Similarly, tick-wave tasks can leverage memory valley period of tock-wave tasks in the same way.",
  "title": null,
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "DynamicConv",
  "full_name": "Dynamic Convolution",
  "description": "**DynamicConv** is a type of [convolution](https://paperswithcode.com/method/convolution) for sequential modelling where it has kernels that vary over time as a learned function of the individual time steps. It builds upon [LightConv](https://paperswithcode.com/method/lightconv) and takes the same form but uses a time-step dependent kernel:\r\n\r\n$$ \\text{DynamicConv}\\left(X, i, c\\right) = \\text{LightConv}\\left(X, f\\left(X\\_{i}\\right)\\_{h,:}, i, c\\right) $$",
  "title": "Pay Less Attention with Lightweight and Dynamic Convolutions",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "Chimera",
  "full_name": "Chimera",
  "description": "**Chimera** is a pipeline model parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models. The key idea of Chimera is to combine two pipelines in different directions (down and up pipelines). \r\n\r\nDenote $N$ as the number of micro-batches executed by each worker within a training iteration, and $D$ the number of pipeline stages (depth), and $P$ the number of workers.\r\n\r\nThe Figure shows an example with four pipeline stages (i.e. $D=4$). Here we assume there are $D$ micro-batches executed by each worker within a training iteration, namely $N=D$, which is the minimum to keep all the stages active. \r\n\r\nIn the down pipeline, stage$\\_{0}$∼stage$\\_{3}$ are mapped to $P\\_{0}∼P\\_{3}$ linearly, while in the up pipeline the stages are mapped in a completely opposite order. The $N$ (assuming an even number) micro-batches are equally partitioned among the two pipelines. Each pipeline schedules $N/2$ micro-batches using 1F1B strategy, as shown in the left part of the Figure. Then, by merging these two pipelines together, we obtain the pipeline schedule of Chimera. Given an even number of stages $D$ (which can be easily satisfied in practice), it is guaranteed that there is no conflict (i.e., there is at most one micro-batch occupies the same time slot on each worker) during merging.",
  "title": "Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines",
  "collection": "Model Parallel Methods",
  "area": "General"
}
{
  "name": "SpatialDropout",
  "full_name": "SpatialDropout",
  "description": "**SpatialDropout** is a type of [dropout](https://paperswithcode.com/method/dropout) for convolutional networks. For a given [convolution](https://paperswithcode.com/method/convolution) feature tensor of size $n\\_{\\text{feats}}$×height×width, we perform only $n\\_{\\text{feats}}$ dropout\r\ntrials and extend the dropout value across the entire feature map. Therefore, adjacent pixels in the dropped-out feature\r\nmap are either all 0 (dropped-out) or all active as illustrated in the figure to the right.",
  "title": "Efficient Object Localization Using Convolutional Networks",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "uPIT",
  "full_name": "utterance level permutation invariant training",
  "description": "",
  "title": "Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech Separation",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "ALIS",
  "full_name": "Aligning Latent and Image Spaces",
  "description": "An infinite image generator which is based on a patch-wise, periodically equivariant generator.",
  "title": "Aligning Latent and Image Spaces to Connect the Unconnectable",
  "collection": "Generative Adversarial Networks",
  "area": "Computer Vision"
}
{
  "name": "Universal Transformer",
  "full_name": "Universal Transformer",
  "description": "The **Universal Transformer** is a generalization of the [Transformer](https://paperswithcode.com/method/transformer) architecture. Universal Transformers combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of [RNNs](https://paperswithcode.com/methods/category/recurrent-neural-networks). They also utilise a dynamic per-position halting mechanism.",
  "title": "Universal Transformers",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "SongNet",
  "full_name": "SongNet",
  "description": "**SongNet** is an auto-regressive [Transformer](https://paperswithcode.com/method/transformer)-based language model for rigid format text detection. Sets of symbols are tailor-designed to improve the modeling performance especially on format, rhyme, and sentence integrity. The attention mechanism is improved to impel the model to capture some future information on the format. A pre-training and fine-tuning framework is designed to further improve the generation quality.",
  "title": "SongNet: Rigid Formats Controlled Text Generation",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Sandwich Transformer",
  "full_name": "Sandwich Transformer",
  "description": "A **Sandwich Transformer** is a variant of a [Transformer](https://paperswithcode.com/method/transformer) that reorders sublayers in the architecture to achieve better performance. The reordering is based on the authors' analysis that models with more self-attention toward the bottom and more\r\nfeedforward sublayers toward the top tend to perform better in general.",
  "title": "Improving Transformer Models by Reordering their Sublayers",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "CAG",
  "full_name": "Class activation guide",
  "description": "Class activation guide is a module which uses weak localization information from the instrument activation maps to guide the verb and target recognition. \r\n\r\nImage source: [Nwoye et al.](https://arxiv.org/pdf/2007.05405v1.pdf)",
  "title": "Recognition of Instrument-Tissue Interactions in Endoscopic Videos via Action Triplets",
  "collection": "Region Proposal",
  "area": "Computer Vision"
}
{
  "name": "AutoGL",
  "full_name": "Automated Graph Learning",
  "description": "Automated graph learning is a method that aims at discovering the best hyper-parameter and neural architecture configuration for different graph tasks/data without manual design.",
  "title": "An adaptive graph learning method for automated molecular interactions and properties predictions",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "WYS",
  "full_name": "Watch Your Step",
  "description": "",
  "title": "Watch Your Step: Learning Node Embeddings via Graph Attention",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "Ensemble Clustering",
  "full_name": "Ensemble Clustering",
  "description": "Ensemble clustering, also called consensus clustering, has\r\nbeen attracting much attention in recent years, aiming to combine multiple base clustering algorithms into a better and more consensus clustering. Due to its good performance, ensemble clustering plays a vital role in many research areas, such as community detection and bioinformatics.",
  "title": "Ensemble Learning for Spectral Clustering",
  "collection": "Clustering",
  "area": "General"
}
{
  "name": "GAM",
  "full_name": "Generalized additive models",
  "description": "",
  "title": null,
  "collection": "Non-Parametric Regression",
  "area": "General"
}
{
  "name": "Noise2Fast",
  "full_name": "Noise2Fast",
  "description": "**Noise2Fast** is a model for single image blind denoising. It is similar to masking based methods -- filling in the pixel gaps -- in that the network is blind to many of the input pixels during training. The method is inspired by Neighbor2Neighbor, where the neural network learns a mapping between adjacent pixels. Noise2Fast is tuned to speed by using a discrete four image training set obtained by a form of downsampling called “checkerboard downsampling.",
  "title": "Noise2Fast: Fast Self-Supervised Single Image Blind Denoising",
  "collection": "Image Denoising Models",
  "area": "Computer Vision"
}
{
  "name": "PReLU",
  "full_name": "Parameterized ReLU",
  "description": "A **Parametric Rectified Linear Unit**, or **PReLU**, is an activation function that generalizes the traditional rectified unit with a slope for negative values. Formally:\r\n\r\n$$f\\left(y\\_{i}\\right) = y\\_{i} \\text{ if } y\\_{i} \\ge 0$$\r\n$$f\\left(y\\_{i}\\right) = a\\_{i}y\\_{i} \\text{ if } y\\_{i} \\leq 0$$\r\n\r\nThe intuition is that different layers may require different types of nonlinearity. Indeed the authors find in experiments with convolutional neural networks that PReLus for the initial layer have more positive slopes, i.e. closer to linear. Since the filters of the first layers are Gabor-like filters such as edge or texture detectors, this shows a circumstance where positive and negative responses of filters are respected. In contrast the authors find deeper layers have smaller coefficients, suggesting the model becomes more discriminative at later layers (while it wants to retain more information at earlier layers).",
  "title": "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "KGRefiner",
  "full_name": "Knowledge Graph Refiner",
  "description": "",
  "title": "KGRefiner: Knowledge Graph Refinement for Improving Accuracy of Translational Link Prediction Methods",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "FA",
  "full_name": "Feedback Alignment",
  "description": "",
  "title": "Random feedback weights support learning in deep neural networks",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "DQN",
  "full_name": "Deep Q-Network",
  "description": "A **DQN**, or Deep Q-Network, approximates a state-value function in a [Q-Learning](https://paperswithcode.com/method/q-learning) framework with a neural network. In the Atari Games case, they take in several frames of the game as an input and output state values for each action as an output. \r\n\r\nIt is usually used in conjunction with [Experience Replay](https://paperswithcode.com/method/experience-replay), for storing the episode steps in memory for off-policy learning, where samples are drawn from the replay memory at random. Additionally, the Q-Network is usually optimized towards a frozen target network that is periodically updated with the latest weights every $k$ steps (where $k$ is a hyperparameter). The latter makes training more stable by preventing short-term oscillations from a moving target. The former tackles autocorrelation that would occur from on-line learning, and having a replay memory makes the problem more like a supervised learning problem.\r\n\r\nImage Source: [here](https://www.researchgate.net/publication/319643003_Autonomous_Quadrotor_Landing_using_Deep_Reinforcement_Learning)",
  "title": "Playing Atari with Deep Reinforcement Learning",
  "collection": "Q-Learning Networks",
  "area": "Reinforcement Learning"
}
{
  "name": "Strip Pooling",
  "full_name": "Strip Pooling",
  "description": "**Strip Pooling** is a pooling strategy for scene parsing which considers a long but narrow kernel, i.e., $1\\times{N}$ or $N\\times{1}$. As an alternative to global pooling, strip pooling offers two advantages. First, it deploys a long kernel shape along one spatial dimension and hence enables capturing long-range relations of isolated regions. Second, it keeps a narrow kernel shape along the other spatial dimension, which facilitates capturing local context and prevents irrelevant regions from interfering the label prediction. Integrating such long but narrow pooling kernels enables the scene parsing networks to simultaneously aggregate both global and local context. This is essentially different from the traditional spatial pooling which collects context from a fixed square region.",
  "title": "Strip Pooling: Rethinking Spatial Pooling for Scene Parsing",
  "collection": "Pooling Operations",
  "area": "Computer Vision"
}
{
  "name": "NAFNet",
  "full_name": "Nonlinear Activation Free Network",
  "description": "",
  "title": "Simple Baselines for Image Restoration",
  "collection": "Image Restoration Models",
  "area": "Computer Vision"
}
{
  "name": "AdaSmooth",
  "full_name": "Adaptive Smooth Optimizer",
  "description": "**AdaSmooth** is a stochastic optimization technique that allows for per-dimension learning rate method for [SGD](https://paperswithcode.com/method/sgd). It is an extension of [Adagrad](https://paperswithcode.com/method/adagrad) and [AdaDelta](https://paperswithcode.com/method/adadelta) that seek to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to a fixed size $w$ while AdaSmooth adaptively selects the size of the window.\r\n\r\nGiven the window size  $M$, the effective ratio is calculated by \r\n\r\n$$e_t  = \\frac{s_t}{n_t}= \\frac{| x_t -  x_{t-M}|}{\\sum_{i=0}^{M-1} | x_{t-i} -  x_{t-1-i}|}\\\\\r\n= \\frac{| \\sum_{i=0}^{M-1} \\Delta x_{t-1-i}|}{\\sum_{i=0}^{M-1} | \\Delta x_{t-1-i}|}.$$\r\n\r\nGiven the effective ratio, the scaled smoothing constant is obtained by:\r\n\r\n$$c_t =  ( \\rho_2- \\rho_1) \\times e_t   + (1-\\rho_2),$$\r\n\r\nThe running average $E\\left[g^{2}\\right]\\_{t}$ at time step $t$ then depends only on the previous average and current gradient:\r\n\r\n$$ E\\left[g^{2}\\right]\\_{t} = c_t^2 \\odot g_{t}^2  +  \\left(1-c_t^2 \\right)\\odot E[g^2]_{t-1} $$\r\n\r\nUsually $\\rho_1$ is set to around $0.5$ and $\\rho_2$ is set to around 0.99. The update step the follows:\r\n\r\n$$ \\Delta x_t = -\\frac{\\eta}{\\sqrt{E\\left[g^{2}\\right]\\_{t} + \\epsilon}} \\odot  g_{t}, $$\r\n\r\nwhich is incorporated into the final update:\r\n\r\n$$x_{t+1} = x_{t} + \\Delta x_t.$$\r\n\r\nThe main advantage of AdaSmooth is its faster convergence rate and insensitivity to hyperparameters.",
  "title": "AdaSmooth: An Adaptive Learning Rate Method based on Effective Ratio",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Griffin-Lim Algorithm",
  "full_name": "Griffin-Lim Algorithm",
  "description": "The **Griffin-Lim Algorithm (GLA)** is a phase reconstruction method based on the redundancy of the short-time Fourier transform. It promotes the consistency of a spectrogram by iterating two projections, where a spectrogram is said to be consistent when its inter-bin dependency owing to the redundancy of STFT is retained.  GLA is based only on the consistency and does not take any prior knowledge about the target signal into account. \r\n\r\nThis algorithm expects to recover a complex-valued spectrogram, which is consistent and maintains the given amplitude $\\mathbf{A}$, by the following alternative projection procedure:\r\n\r\n$$ \\mathbf{X}^{[m+1]} = P\\_{\\mathcal{C}}\\left(P\\_{\\mathcal{A}}\\left(\\mathbf{X}^{[m]}\\right)\\right) $$\r\n\r\nwhere $\\mathbf{X}$ is a complex-valued spectrogram updated through the iteration, $P\\_{\\mathcal{S}}$ is the metric projection onto a set $\\mathcal{S}$, and $m$ is the iteration index. Here, $\\mathcal{C}$ is the set of consistent spectrograms, and $\\mathcal{A}$ is the set of spectrograms whose amplitude is the same as the given one. The metric projections onto these sets $\\mathcal{C}$ and $\\mathcal{A}$ are given by:\r\n\r\n$$ P\\_{\\mathcal{C}}(\\mathbf{X}) = \\mathcal{GG}^{†}\\mathbf{X} $$\r\n$$ P\\_{\\mathcal{A}}(\\mathbf{X}) = \\mathbf{A} \\odot \\mathbf{X} \\oslash |\\mathbf{X}| $$\r\n\r\n\r\nwhere $\\mathcal{G}$ represents STFT, $\\mathcal{G}^{†}$ is the pseudo inverse of STFT (iSTFT), $\\odot$ and $\\oslash$ are element-wise multiplication and division, respectively, and division by zero is replaced by zero. GLA is obtained as an algorithm for the following optimization problem:\r\n\r\n$$ \\min\\_{\\mathbf{X}} || \\mathbf{X} - P\\_{\\mathcal{C}}\\left(\\mathbf{X}\\right) ||^{2}\\_{\\text{Fro}} \\text{ s.t. } \\mathbf{X} \\in \\mathcal{A} $$\r\n\r\nwhere $ || · ||\\_{\\text{Fro}}$ is the Frobenius norm. This equation minimizes the energy of the inconsistent components under the constraint on amplitude which must be equal to the given one. Although GLA has been widely utilized because of its simplicity, GLA often involves many iterations until it converges to a certain spectrogram and results in low reconstruction quality. This is because the cost function only requires the consistency, and the characteristics of the target signal are not taken into account.",
  "title": null,
  "collection": "Phase Reconstruction",
  "area": "Audio"
}
{
  "name": "VGAE",
  "full_name": "Variational Graph Auto Encoder",
  "description": "",
  "title": "Variational Graph Auto-Encoders",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "DSAM loss",
  "full_name": "Distance Shrinking with Angular Marginalizing Loss",
  "description": "",
  "title": "DSAM: A Distance Shrinking with Angular Marginalizing Loss for High Performance Vehicle Re-identificatio",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "Compressed Memory",
  "full_name": "Compressed Memory",
  "description": "**Compressed Memory** is a secondary FIFO memory component proposed as part of the [Compressive Transformer](https://paperswithcode.com/method/compressive-transformer) model. The Compressive [Transformer](https://paperswithcode.com/method/transformer) keeps a fine-grained memory of past activations, which are then compressed into coarser compressed memories. \r\n\r\nFor choices of compression functions $f\\_{c}$ the authors consider (1) max/mean pooling, where the kernel and stride is set to the compression rate $c$; (2) 1D [convolution](https://paperswithcode.com/method/convolution) also with kernel & stride set to $c$; (3) dilated convolutions; (4) *most-used* where the memories are sorted by their average attention (usage) and the most-used are preserved.",
  "title": "Compressive Transformers for Long-Range Sequence Modelling",
  "collection": "Miscellaneous Components",
  "area": "General"
}
{
  "name": "FastSGT",
  "full_name": "FastSGT",
  "description": "**Fast Schema Guided Tracker**, or **FastSGT**, is a fast and robust [BERT](https://paperswithcode.com/method/bert)-based model for state tracking in goal-oriented dialogue systems. The model employs carry-over mechanisms for transferring the values between slots, enabling switching between services and accepting the values offered by the system during dialogue. It also uses [multi-head attention](https://paperswithcode.com/method/multi-head-attention) projections in some of the decoders to have a better modelling of the encoder outputs.\r\n\r\nThe model architecture is illustrated in the Figure. It consists of four main modules: 1-Utterance Encoder, 2-Schema Encoder, 3-State Decoder, and 4-State Tracker. The first three modules constitute the NLU component and are based on neural networks, whereas the state tracker is a rule-based module. [BERT](https://paperswithcode.com/method/bert) was used for both encoders in the model.\r\n\r\nThe Utterance Encoder is a BERT model which encodes the user and system utterances at each turn. The Schema Encoder is also a BERT model which encodes the schema descriptions of intents, slots, and values into schema embeddings. These schema embeddings help the decoders to transfer or share knowledge between different services by having some language understanding of each slot, intent, or value. The schema and utterance embeddings are passed to the State Decoder - a multi-task module. This module consists of five sub-modules producing the information necessary to track the state of the dialogue. Finally, the State Tracker module takes the previous state along with the current outputs of the State Decoder and predicts the current state of the dialogue by aggregating and summarizing the information across turns.",
  "title": "A Fast and Robust BERT-based Dialogue State Tracker for Schema-Guided Dialogue Dataset",
  "collection": "Dialogue State Trackers",
  "area": "Natural Language Processing"
}
{
  "name": "Deep Ensembles",
  "full_name": "Deep Ensembles",
  "description": "",
  "title": "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "TopK Copy",
  "full_name": "TopK Copy",
  "description": "**TopK Copy** is a cross-attention guided copy mechanism for entity extraction where only the Top-$k$ important attention heads are used for computing copy distributions. The motivation is that that attention heads may not equally important, and that some heads can be pruned out with a marginal decrease in overall performance. Attention probabilities produced by insignificant attention heads may be noisy. Thus, computing copy distributions without these heads could improve the model’s ability to infer the importance of each token in the input document.",
  "title": "Document-level Entity-based Extraction as Template Generation",
  "collection": "Copy Mechanisms",
  "area": "Natural Language Processing"
}
{
  "name": "SlowMo",
  "full_name": "SlowMo",
  "description": "**Slow Momentum** (SlowMo) is a distributed optimization method where workers periodically synchronize and perform a momentum update, after multiple iterations of a base optimization algorithm.  Periodically, after taking some number $\\tau$ of base algorithm steps, workers average their parameters using ALLREDUCE and perform a momentum update.",
  "title": "SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "CInC Flow",
  "full_name": "Characterizable Invertible 3x3 Convolution",
  "description": "Characterizable Invertible $3\\times3$  Convolution",
  "title": "CInC Flow: Characterizable Invertible 3x3 Convolution",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "DGI",
  "full_name": "Deep Graph Infomax",
  "description": "Deep Graph Infomax (DGI), a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs—both derived using established graph convolutional network architectures. The learnt patch representations summarize subgraphs centered around nodes of interest, and can thus be reused for downstream node-wise learning tasks. In contrast to most prior approaches to unsupervised learning with GCNs, DGI does not rely on random walk objectives, and is readily applicable to both transductive and inductive learning setups.\r\n\r\nDescription and image from: [DEEP GRAPH INFOMAX](https://arxiv.org/pdf/1809.10341.pdf)",
  "title": "Deep Graph Infomax",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "CTRL",
  "full_name": "CTRL",
  "description": "**CTRL** is conditional [transformer](https://paperswithcode.com/method/transformer) language model, trained\r\nto condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw\r\ntext, preserving the advantages of unsupervised learning while providing more\r\nexplicit control over text generation. These codes also allow CTRL to predict\r\nwhich parts of the training data are most likely given a sequence",
  "title": "CTRL: A Conditional Transformer Language Model for Controllable Generation",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Span-Based Dynamic Convolution",
  "full_name": "Span-Based Dynamic Convolution",
  "description": "**Span-Based Dynamic Convolution** is a type of convolution used in the [ConvBERT](https://paperswithcode.com/method/convbert) architecture to capture local dependencies between tokens.  Kernels are generated by taking in a local span of current token, which better utilizes local dependency and discriminates different meanings of the same token (e.g., if “a” is in front of “can” in the input sentence, “can” is apparently a noun not a verb).\r\n\r\nSpecifically, with [classic convolution](https://paperswithcode.com/method/convolution), we would have fixed parameters shared for all input tokens. [Dynamic convolution](https://paperswithcode.com/method/dynamicconv) is therefore preferable because it has  higher flexibility in capturing local dependencies of different tokens. Dynamic convolution uses a kernel generator to produce different kernels for different input tokens. However, such dynamic convolution cannot differentiate the same tokens within different context and\r\ngenerate the same kernels (e.g., the three “can” in Figure (b)).\r\n\r\nTherefore the span-based dynamic convolution is developed to produce more adaptive convolution kernels by receiving an input span instead of only a single token, which enables discrimination of generated kernels for the same tokens within different context. For example, as shown in Figure (c), span-based dynamic convolution produces different kernels for different “can” tokens.",
  "title": "ConvBERT: Improving BERT with Span-based Dynamic Convolution",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "PEGASUS",
  "full_name": "PEGASUS",
  "description": "**PEGASUS** proposes a transformer-based model for abstractive summarization. It uses a special self-supervised pre-training objective called gap-sentences generation (GSG) that's designed to perform well on summarization-related downstream tasks. As reported in the paper, \"both GSG and MLM are applied simultaneously to this example as pre-training objectives. Originally there are three sentences. One sentence is masked with [MASK1] and used as target generation text (GSG). The other two sentences remain in the input, but some tokens are randomly masked by [MASK2].\"",
  "title": "PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Causal inference",
  "full_name": "Causal inference",
  "description": "Causal inference is the process of drawing a conclusion about a causal connection based on the conditions of the occurrence of an effect. The main difference between causal inference and inference of association is that the former analyzes the response of the effect variable when the cause is changed.",
  "title": null,
  "collection": null,
  "area": null
}
{
  "name": "Q-Learning",
  "full_name": "Q-Learning",
  "description": "**Q-Learning** is an off-policy temporal difference control algorithm:\r\n\r\n$$Q\\left(S\\_{t}, A\\_{t}\\right) \\leftarrow Q\\left(S\\_{t}, A\\_{t}\\right) + \\alpha\\left[R_{t+1} + \\gamma\\max\\_{a}Q\\left(S\\_{t+1}, a\\right) - Q\\left(S\\_{t}, A\\_{t}\\right)\\right] $$\r\n\r\nThe learned action-value function $Q$ directly approximates $q\\_{*}$, the optimal action-value function, independent of the policy being followed.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition",
  "title": null,
  "collection": "Off-Policy TD Control",
  "area": "Reinforcement Learning"
}
{
  "name": "3DSSD",
  "full_name": "3DSSD",
  "description": "**3DSSD** is a point-based 3D single stage object detection detector. In this paradigm, all upsampling layers and refinement stage, which are indispensable in all existing point-based methods, are abandoned to reduce the large computation cost. The authors propose a fusion sampling strategy in the downsampling process to make detection on less representative points feasible. A delicate box prediction network including a candidate generation layer, an anchor-free regression head with a 3D center-ness assignment strategy is designed to meet the needs of accuracy and speed.",
  "title": "3DSSD: Point-based 3D Single Stage Object Detector",
  "collection": "3D Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "TraDeS",
  "full_name": "TraDeS",
  "description": "**TradeS** is an online joint detection and tracking model, coined as TRACK to DEtect and Segment, exploiting tracking clues to assist detection end-to-end. TraDeS infers object tracking offset by a cost volume, which is used to propagate previous object features for improving current object detection and segmentation.",
  "title": "Track to Detect and Segment: An Online Multi-Object Tracker",
  "collection": "Multi-Object Tracking Models",
  "area": "Computer Vision"
}
{
  "name": "Population Based Training",
  "full_name": "Population Based Training",
  "description": "**Population Based Training**, or **PBT**, is an optimization method for finding parameters and hyperparameters, and extends upon parallel search methods and sequential optimisation methods.\r\nIt leverages information sharing across a population of concurrently running optimisation processes, and allows for online propagation/transfer of parameters and hyperparameters between members of the population based on their performance. Furthermore, unlike most other adaptation schemes, the method is capable of performing online adaptation of hyperparameters -- which can be particularly important in problems with highly non-stationary learning dynamics, such as reinforcement learning settings. PBT is decentralised and asynchronous, although it could also be executed semi-serially or with partial synchrony if there is a binding budget constraint.",
  "title": "Population Based Training of Neural Networks",
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "HBMP",
  "full_name": "Hierarchical BiLSTM Max Pooling",
  "description": "HBMP is a hierarchy-like structure of [BiLSTM](https://paperswithcode.com/method/bilstm) layers with [max pooling](https://paperswithcode.com/method/max-pooling). All in all, this model improves the previous state of the art for SciTail and achieves strong results for the SNLI and MultiNLI.",
  "title": "Sentence Embeddings in NLI with Iterative Refinement Encoders",
  "collection": "Sequence To Sequence Models",
  "area": "Sequential"
}
{
  "name": "SAFRAN",
  "full_name": "SAFRAN - Scalable and fast non-redundant rule application",
  "description": "SAFRAN is a rule application framework which aggregates rules through a scalable clustering algorithm.",
  "title": "Scalable and interpretable rule-based link prediction for large heterogeneous knowledge graphs",
  "collection": "Rule-based systems",
  "area": "General"
}
{
  "name": "STN",
  "full_name": "spatial transformer networks",
  "description": "spatial transformer networks uses an explicit procedure to learn invariance to translation, scaling, rotation and other more general warps, making the network pay attention to the most relevant regions. STN was the first attention mechanism to explicitly predict important regions and provide a deep neural network with transformation invariance.\r\n\r\nTaking a 2D image as an example, a 2D affine transformation can be formulated as followed, where A denotes a $ 2 \\times 3 $ learneable affine matrix:\r\n\r\n\\begin{align}\r\nA = f_\\text{loc}(U) \r\n\\end{align}\r\n\\begin{align}\r\nx_i^s = A x_i^t\r\n\\end{align}\r\n\r\nHere, $U$ is the input feature map, and $f_\\text{loc}$ can be any differentiable function, such as a lightweight fully-connected network or convolutional neural network. $x_{i}^{s}$  is coordinates in the output feature map, while $x_{i}^{t}$ is corresponding coordinates in the input feature map and the $ A $ matrix is the learnable affine matrix. After obtaining the correspondence, the network can sample relevant input regions using the correspondence. \r\nTo ensure that the whole process is differentiable and can be updated in an end-to-end manner,  bilinear sampling is used to sample the input features.\r\n\r\nSTNs focus on discriminative regions automatically\r\nand  learn invariance to some geometric transformations.",
  "title": "Spatial Transformer Networks",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Procrustes",
  "full_name": "Procrustes",
  "description": "Procrustes",
  "title": "Procrustes registration of two-dimensional statistical shape models without correspondences",
  "collection": "Generalized Linear Models",
  "area": "General"
}
{
  "name": "SAM",
  "full_name": "Segment Anything Model",
  "description": "",
  "title": "Segment Anything",
  "collection": "Image Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "Make-A-Scene",
  "full_name": "Make-A-Scene",
  "description": "Make-A-Scene is a text-to-image method that (i) enables a simple control mechanism complementary to text in the form of a scene, (ii) introduces elements that improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects), and (iii) adapts classifier-free guidance for the transformer use case.",
  "title": "Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors",
  "collection": "Image Generation Models",
  "area": "Computer Vision"
}
{
  "name": "EMEA",
  "full_name": "Entropy Minimized Ensemble of Adapters",
  "description": "**Entropy Minimized Ensemble of Adapters**, or **EMEA**, is a method that optimizes the ensemble weights of the pretrained language adapters for each test sentence by minimizing the entropy of its predictions. The intuition behind the method is that a good [adapter](https://paperswithcode.com/method/adapter) weight $\\alpha$ for a test input $x$ should make the model more confident in its prediction for $x$, that is, it should lead to lower model entropy over the input",
  "title": "Efficient Test Time Adapter Ensembling for Low-resource Language Varieties",
  "collection": "Ensembling",
  "area": "General"
}
{
  "name": "Weight Tying",
  "full_name": "Weight Tying",
  "description": "**Weight Tying** improves the performance of language models by tying (sharing) the weights of the embedding and [softmax](https://paperswithcode.com/method/softmax) layers. This method also massively reduces the total number of parameters in the language models that it is applied to. \r\n\r\nLanguage models are typically comprised of an embedding layer, followed by a number of [Transformer](https://paperswithcode.com/method/transformer) or [LSTM](https://paperswithcode.com/method/lstm) layers, which are finally followed by a softmax layer. Embedding layers learn word representations, such that similar words (in meaning) are represented by vectors that are near each other (in cosine distance). [Press & Wolf, 2016] showed that the softmax matrix, in which every word also has a vector representation, also exhibits this property. This leads them to propose to share the softmax and embedding matrices, which is done today in nearly all language models.  \r\n\r\nThis method was independently introduced by [Press & Wolf, 2016](https://paperswithcode.com/paper/using-the-output-embedding-to-improve) and [Inan et al, 2016](https://paperswithcode.com/paper/tying-word-vectors-and-word-classifiers-a).\r\n\r\nAdditionally, the Press & Wolf paper proposes Three-way Weight Tying, a method for NMT models in which the embedding matrix for the source language, the embedding matrix for the target language, and the softmax matrix for the target language are all tied. That method has been adopted by the Attention Is All You Need model and many other neural machine translation models.",
  "title": "Using the Output Embedding to Improve Language Models",
  "collection": "Parameter Sharing",
  "area": "General"
}
{
  "name": "CPE",
  "full_name": "Collaborative Preference Embedding",
  "description": "CPE is an effective collaborative metric learning to effectively address the problem of sparse and insufficient preference supervision from the margin distribution point-of-view.",
  "title": "Collaborative Preference Embedding against Sparse Labels",
  "collection": "Recommendation Systems",
  "area": "General"
}
{
  "name": "k-Means Clustering",
  "full_name": "k-Means Clustering",
  "description": "**k-Means Clustering** is a clustering algorithm that divides a training set into $k$ different clusters of examples that are near each other. It works by initializing $k$ different centroids {$\\mu\\left(1\\right),\\ldots,\\mu\\left(k\\right)$} to different values, then alternating between two steps until convergence:\r\n\r\n(i) each training example is assigned to cluster $i$ where $i$ is the index of the nearest centroid $\\mu^{(i)}$\r\n\r\n(ii) each centroid $\\mu^{(i)}$ is updated to the mean of all training examples $x^{(j)}$ assigned to cluster $i$.\r\n\r\nText Source: Deep Learning, Goodfellow et al\r\n\r\nImage Source: [scikit-learn](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html)",
  "title": null,
  "collection": "Clustering",
  "area": "General"
}
{
  "name": "Convolution",
  "full_name": "Convolution",
  "description": "A **convolution** is a type of matrix operation, consisting of a kernel, a small matrix of weights, that slides over input data performing element-wise multiplication with the part of the input it is on, then summing the results into an output.\r\n\r\nIntuitively, a convolution allows for weight sharing - reducing the number of effective parameters - and image translation (allowing for the same feature to be detected in different parts of the input space).\r\n\r\nImage Source: [https://arxiv.org/pdf/1603.07285.pdf](https://arxiv.org/pdf/1603.07285.pdf)",
  "title": null,
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "NAS-FPN",
  "full_name": "NAS-FPN",
  "description": "**NAS-FPN** is a Feature Pyramid Network that is discovered via [Neural Architecture Search](https://paperswithcode.com/method/neural-architecture-search) in a novel scalable search space covering all cross-scale connections. The discovered architecture consists of a combination of top-down and bottom-up connections to fuse features across scales",
  "title": "NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "FastMoE",
  "full_name": "FastMoE",
  "description": "**FastMoE ** is a distributed MoE training system based on PyTorch with common accelerators. The system provides a hierarchical interface for both flexible model design and adaption to different applications, such as [Transformer-XL](https://paperswithcode.com/method/transformer-xl) and Megatron-LM.",
  "title": "FastMoE: A Fast Mixture-of-Expert Training System",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "AD-GCL",
  "full_name": "Adversarial Graph Contrastive Learning",
  "description": "",
  "title": "Adversarial Graph Augmentation to Improve Graph Contrastive Learning",
  "collection": "Graph Representation Learning",
  "area": "Graphs"
}
{
  "name": "InfoGraph",
  "full_name": "InfoGraph",
  "description": "",
  "title": "InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization",
  "collection": "Graph Representation Learning",
  "area": "Graphs"
}
{
  "name": "TD3",
  "full_name": "Twin Delayed Deep Deterministic",
  "description": "**TD3** builds on the [DDPG](https://paperswithcode.com/method/ddpg) algorithm for reinforcement learning, with a couple of modifications aimed at tackling overestimation bias with the value function. In particular, it utilises [clipped double Q-learning](https://paperswithcode.com/method/clipped-double-q-learning), delayed update of target and policy networks, and [target policy smoothing](https://paperswithcode.com/method/target-policy-smoothing) (which is similar to a [SARSA](https://paperswithcode.com/method/sarsa) based update; a safer update, as they provide higher value to actions resistant to perturbations).",
  "title": "Addressing Function Approximation Error in Actor-Critic Methods",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "Neural Tangent Transfer",
  "full_name": "Neural Tangent Transfer",
  "description": "**Neural Tangent Transfer**, or **NTT**, is a method for finding trainable sparse networks in a label-free manner. Specifically, NTT finds sparse networks whose training dynamics, as characterized by the neural tangent kernel, mimic those of dense networks in function space.",
  "title": "Finding trainable sparse networks through Neural Tangent Transfer",
  "collection": "Sparsity",
  "area": "General"
}
{
  "name": "PresGAN",
  "full_name": "Prescribed Generative Adversarial Network",
  "description": "**Prescribed GANs** add noise to the output of a density network and optimize an entropy-regularized adversarial loss. The added noise renders tractable approximations of the predictive log-likelihood and stabilizes the training procedure. The entropy regularizer encourages PresGANs to capture all the modes of the data distribution. Fitting PresGANs involves computing the intractable gradients of the [entropy regularization](https://paperswithcode.com/method/entropy-regularization) term; PresGANs sidestep this intractability using\r\nunbiased stochastic estimates.",
  "title": "Prescribed Generative Adversarial Networks",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Spectral Clustering",
  "full_name": "Spectral Clustering",
  "description": "Spectral clustering has attracted increasing attention due to\r\nthe promising ability in dealing with nonlinearly separable datasets [15], [16]. In spectral clustering, the spectrum of the graph Laplacian is used to reveal the cluster structure. The spectral clustering algorithm mainly consists of two steps: 1) constructs the low dimensional embedded representation of the data based on the eigenvectors of the graph Laplacian, 2) applies k-means on the constructed low dimensional data to obtain the clustering result. Thus,",
  "title": "A Tutorial on Spectral Clustering",
  "collection": "Clustering",
  "area": "General"
}
{
  "name": "OMGD",
  "full_name": "Online Multi-granularity Distillation",
  "description": "**OMGD**, or **Online Multi-Granularity Distillation** is a framework for learning efficient [GANs](https://paperswithcode.com/methods/category/generative-adversarial-networks). The student generator is optimized in a discriminator-free and ground-truth-free setting. The scheme trains the teacher and student alternatively, promoting these two generators iteratively and progressively. The progressively optimized teacher generator helps to warm up the student and guide the optimization direction step by step.\r\n\r\nSpecifically, the student generator $G\\_{S}$ only leverages the complementary teacher generators $G^{W}\\_{T}$ and $G^{D}\\_{T}$ for optimization and can be trained in the discriminator-free and ground-truth-free setting. This framework transfers different levels concepts from the intermediate layers and output layer to perform the knowledge distillation. The whole optimization is conducted on an online distillation scheme. Namely,  $G^{W}\\_{T}$, $G^{D}\\_{T}$ and $G\\_{S}$ are optimized simultaneously and progressively.",
  "title": "Online Multi-Granularity Distillation for GAN Compression",
  "collection": "Knowledge Distillation",
  "area": "General"
}
{
  "name": "AUCO ResNet",
  "full_name": "Auditory Cortex ResNet",
  "description": "The Auditory Cortex ResNet, briefly AUCO ResNet, is proposed and tested. It is a deep neural network architecture especially designed for audio classification trained end-to-end. It is inspired by the architectural organization of rat's auditory cortex, containing also innovations 2 and 3. The network outperforms the state-of-the-art accuracies on a reference audio benchmark dataset without any kind of preprocessing, imbalanced data handling and, most importantly, any kind of data augmentation.",
  "title": "AUCO ResNet: an end-to-end network for Covid-19 pre-screening from cough and breath",
  "collection": "Audio Model Blocks",
  "area": "Audio"
}
{
  "name": "Gradual Self-Training",
  "full_name": "Gradual Self-Training",
  "description": "Gradual self-training is a method for semi-supervised domain adaptation. The goal is to adapt an initial classifier trained on a source domain given only unlabeled data that shifts gradually in distribution towards a target domain. \r\n\r\nThis comes up for example in applications ranging from sensor networks and self-driving car perception modules to brain-machine interfaces, where machine learning systems must adapt to data distributions that evolve over time.\r\n\r\nThe gradual self-training algorithm begins with a classifier $w_0$ trained on labeled examples from the source domain (Figure a). For each successive domain $P_t$, the algorithm generates pseudolabels for unlabeled examples from that domain, and then trains a regularized supervised classifier on the pseudolabeled examples. The intuition, visualized in the Figure, is that after a single gradual shift, most examples are pseudolabeled correctly so self-training learns a good classifier on the shifted data, but the shift from the source to the target can be too large for self-training to correct.",
  "title": "Understanding Self-Training for Gradual Domain Adaptation",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "IICNet",
  "full_name": "IICNet",
  "description": "**Invertible Image Conversion Net**, or **IICNet**, is a generic framework for reversible image conversion tasks. Unlike previous encoder-decoder based methods, IICNet maintains a highly invertible structure based on invertible neural networks (INNs) to better preserve the information during conversion. It uses a relation module and a channel squeeze layer to improve the INN nonlinearity to extract cross-image relations and the network flexibility, respectively.",
  "title": "IICNet: A Generic Framework for Reversible Image Conversion",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "RGCN",
  "full_name": "Relational Graph Convolution Network",
  "description": "An **RGCN**, or **Relational Graph Convolution Network**, is a an application of the [GCN framework](https://paperswithcode.com/method/gcn) to modeling relational data, specifically\r\nto link prediction and entity classification tasks.\r\n\r\nSee [here](https://docs.dgl.ai/en/0.4.x/tutorials/models/1_gnn/4_rgcn.html) for an in-depth explanation of RGCNs by DGL.",
  "title": "Modeling Relational Data with Graph Convolutional Networks",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "MinCutPool",
  "full_name": "MinCut Pooling",
  "description": "MinCutPool is a trainable pooling operator for graphs that learns to map nodes into clusters.\r\nThe method is trained to approximate the minimum K-cut of the graph to ensure that the clusters are balanced, while also jointly optimizing the objective of the task at hand.",
  "title": "Spectral Clustering with Graph Neural Networks for Graph Pooling",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "Balanced L1 Loss",
  "full_name": "Balanced L1 Loss",
  "description": "**Balanced L1 Loss** is a loss function used for the object detection task. Classification and localization problems are solved simultaneously under the guidance of a multi-task loss since\r\n[Fast R-CNN](https://paperswithcode.com/method/fast-r-cnn), defined as:\r\n\r\n$$ L\\_{p,u,t\\_{u},v} = L\\_{cls}\\left(p, u\\right) + \\lambda\\left[u \\geq 1\\right]L\\_{loc}\\left(t^{u}, v\\right) $$\r\n\r\n$L\\_{cls}$ and $L\\_{loc}$ are objective functions corresponding to recognition and localization respectively. Predictions and targets in $L\\_{cls}$ are denoted as $p$ and $u$. $t\\_{u}$ is the corresponding regression results with class $u$. $v$ is the regression target. $\\lambda$ is used for tuning the loss weight under multi-task learning. We call samples with a loss greater than or equal to 1.0 outliers. The other samples are called inliers.\r\n\r\nA natural solution for balancing the involved tasks is to tune the loss weights of them. However, owing to the unbounded regression targets, directly raising the weight of localization loss will make the model more sensitive to outliers. These outliers, which can be regarded as hard samples, will produce excessively large gradients that are harmful to the training process. The inliers, which can be regarded as the easy samples, contribute little gradient to the overall gradients compared with the outliers. To be more specific, inliers only contribute 30% gradients average per sample compared with outliers. Considering these issues, the authors introduced the balanced L1 loss, which is denoted as $L\\_{b}$.\r\n\r\nBalanced L1 loss is derived from the conventional smooth L1 loss, in which an inflection point is set to separate inliers from outliners, and clip the large gradients produced by outliers with a maximum value of 1.0, as shown by the dashed lines in the Figure to the right. The key idea of balanced L1 loss is promoting the crucial regression gradients, i.e. gradients from inliers (accurate samples), to rebalance\r\nthe involved samples and tasks, thus achieving a more balanced training within classification, overall localization and accurate localization. Localization loss $L\\_{loc}$ uses balanced L1 loss is defined as:\r\n\r\n$$ L\\_{loc} = \\sum\\_{i\\in{x,y,w,h}}L\\_{b}\\left(t^{u}\\_{i}-v\\_{i}\\right) $$\r\n\r\nThe Figure to the right shows that the balanced L1 loss increases the gradients of inliers under the control of a factor denoted as $\\alpha$. A small $\\alpha$ increases more gradient for inliers, but the gradients of outliers are not influenced. Besides, an overall promotion magnification controlled by γ is also brought in for tuning the upper bound of regression errors, which can help the objective function better balancing involved tasks. The two factors that control different aspects are mutually enhanced to reach a more balanced training.$b$ is used to ensure $L\\_{b}\\left(x = 1\\right)$ has the same value for both formulations in the equation below.\r\n\r\nBy integrating the gradient formulation above, we can get the balanced L1 loss as:\r\n\r\n$$ L\\_{b}\\left(x\\right) = \\frac{\\alpha}{b}\\left(b|x| + 1\\right)ln\\left(b|x| + 1\\right) - \\alpha|x| \\text{ if } |x| < 1$$\r\n\r\n$$ L\\_{b}\\left(x\\right) = \\gamma|x| + C \\text{ otherwise } $$\r\n\r\nin which the parameters $\\gamma$, $\\alpha$, and $b$ are constrained by $\\alpha\\text{ln}\\left(b + 1\\right) = \\gamma$. The default parameters are set as $\\alpha = 0.5$ and $\\gamma = 1.5$",
  "title": "Libra R-CNN: Towards Balanced Learning for Object Detection",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "GoogLeNet",
  "full_name": "GoogLeNet",
  "description": "**GoogLeNet** is a type of convolutional neural network based on the [Inception](https://paperswithcode.com/method/inception-module) architecture. It utilises Inception modules, which allow the network to choose between multiple convolutional filter sizes in each block. An Inception network stacks these modules on top of each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid.",
  "title": "Going Deeper with Convolutions",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Lbl2TransformerVec",
  "full_name": "Lbl2TransformerVec",
  "description": "",
  "title": "Evaluating Unsupervised Text Classification: Zero-shot and Similarity-based Approaches",
  "collection": "Text Classification Models",
  "area": "Natural Language Processing"
}
{
  "name": "Counterfactuals",
  "full_name": "Counterfactuals Explanations",
  "description": "",
  "title": "Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR",
  "collection": "Exploration Strategies",
  "area": "Reinforcement Learning"
}
{
  "name": "Activation Normalization",
  "full_name": "Activation Normalization",
  "description": "**Activation Normalization** is a type of normalization used for flow-based generative models; specifically it was introduced in the [GLOW](https://paperswithcode.com/method/glow) architecture. An ActNorm layer performs an affine transformation of the activations using a scale and bias parameter per channel, similar to [batch normalization](https://paperswithcode.com/method/batch-normalization). These parameters are initialized such that the post-actnorm activations per-channel have zero mean and unit variance given an initial minibatch of data. This is a form of data dependent initilization. After initialization, the scale and bias are treated as regular trainable parameters that are independent of the data.",
  "title": "Glow: Generative Flow with Invertible 1x1 Convolutions",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "T5",
  "full_name": "T5",
  "description": "**T5**, or **Text-to-Text Transfer Transformer**, is a [Transformer](https://paperswithcode.com/method/transformer) based architecture that uses a text-to-text approach. Every task – including translation, question answering, and classification – is cast as feeding the model text as input and training it to generate some target text. This allows for the use of the same model, loss function, hyperparameters, etc. across our diverse set of tasks. The changes compared to [BERT](https://paperswithcode.com/method/bert) include:\r\n\r\n- adding a *causal* decoder to the bidirectional architecture.\r\n- replacing the fill-in-the-blank cloze task with a mix of alternative pre-training tasks.",
  "title": "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "ALP-GMM",
  "full_name": "Absolute Learning Progress and Gaussian Mixture Models for Automatic Curriculum Learning",
  "description": "ALP-GMM is is an algorithm that learns to generate a learning curriculum for black box reinforcement learning agents, whereby it sequentially samples parameters controlling a stochastic procedural generation of tasks or environments.",
  "title": "Teacher algorithms for curriculum learning of Deep RL in continuously parameterized environments",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "PeleeNet",
  "full_name": "PeleeNet",
  "description": "**PeleeNet** is a convolutional neural network  and object detection backbone that is a variation of [DenseNet](https://paperswithcode.com/method/densenet) with optimizations to meet a memory and computational budget. Unlike competing networks, it does not use depthwise convolutions and instead relies on regular convolutions.",
  "title": "Pelee: A Real-Time Object Detection System on Mobile Devices",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "DPN Block",
  "full_name": "DPN Block",
  "description": "A **Dual Path Network** block is an image model block used in convolutional neural network. The idea of this module is to enable sharing of common features while maintaining the flexibility to explore new features through dual path architectures. In this sense it combines the benefits of [ResNets](https://paperswithcode.com/method/resnet) and [DenseNets](https://paperswithcode.com/method/densenet). It was proposed as part of the [DPN](https://paperswithcode.com/method/dpn) CNN architecture.\r\n\r\nWe formulate such a dual path architecture as follows:\r\n\r\n$$x^{k} = \\sum\\limits\\_{t=1}^{k-1} f\\_t^{k}(h^t) \\text{,}  $$\r\n\r\n$$\r\ny^{k} = \\sum\\limits\\_{t=1}^{k-1} v\\_t(h^t) = y^{k-1} + \\phi^{k-1}(y^{k-1}) \\text{,} \\\\\\\\\r\n$$\r\n\r\n$$\r\nr^{k} = x^{k} + y^{k} \\text{,} \\\\\\\\\r\n$$\r\n\r\n$$\r\nh^k = g^k \\left( r^{k} \\right) \\text{,}\r\n$$\r\n\r\nwhere $x^{k}$ and $y^{k}$ denote the extracted information at $k$-th step from individual path, $v_t(\\cdot)$ is a feature learning function as $f_t^k(\\cdot)$. The first equation refers to the densely connected path that enables exploring new features. The second equation refers to the residual path that enables common features re-usage. The third equation defines the dual path that integrates them and feeds them to the last transformation function in the last equation.",
  "title": "Dual Path Networks",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "TGAN",
  "full_name": "TGAN",
  "description": "**TGAN** is a type of generative adversarial network that is capable of learning representation from an unlabeled video dataset and producing a new video. The generator consists of two sub networks\r\ncalled a temporal generator and an image generator. Specifically, the temporal generator first yields a set of latent variables, each of which corresponds to a latent variable for the image generator. Then, the image generator transforms these latent variables into a video which has the same number of frames as the variables. The model comprised of the temporal and image generators can not only enable to efficiently capture the time series, but also be easily extended to frame interpolation. The authors opt for a [WGAN](https://paperswithcode.com/method/wgan) as the basic [GAN](https://paperswithcode.com/method/gan) structure and objective, but use [singular value clipping](https://paperswithcode.com/method/singular-value-clipping) to enforce the Lipschitz constraint.",
  "title": "Temporal Generative Adversarial Nets with Singular Value Clipping",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "self-mem + new data",
  "full_name": "self-mem + new data",
  "description": "The training contains two steps. The first step is to train data for creating the T2D and D2T models. Then, in the second step, use these\r\nmodels to infer self-memory for self-training the D2T model.  Then, the training takes self-memory and new data for self-training the D2T model but not the T2D model.",
  "title": "Self-training from Self-memory in Data-to-text Generation",
  "collection": "Self-Training Methods",
  "area": "General"
}
{
  "name": "GTrXL",
  "full_name": "Gated Transformer-XL",
  "description": "**Gated Transformer-XL**, or **GTrXL**, is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based architecture for reinforcement learning. It introduces architectural modifications that improve the stability and learning speed of the original Transformer and XL variant. Changes include:\r\n\r\n- Placing the [layer normalization](https://paperswithcode.com/method/layer-normalization) on only the input stream of the submodules. A key benefit to this reordering is that it now enables an identity map from the input of the transformer at the first layer to the output of the transformer after the last layer. This is in contrast to the canonical transformer, where there are a series of layer normalization operations that non-linearly transform the state encoding.\r\n- Replacing [residual connections](https://paperswithcode.com/method/residual-connection) with gating layers. The authors' experiments found that [GRUs](https://www.paperswithcode.com/method/gru) were the most effective form of gating.",
  "title": "Stabilizing Transformers for Reinforcement Learning",
  "collection": "RL Transformers",
  "area": "Reinforcement Learning"
}
{
  "name": "Neighborhood Attention",
  "full_name": "Neighborhood Attention",
  "description": "Neighborhood Attention is a restricted self attention pattern in which each token's receptive field is limited to its nearest neighboring pixels. It was proposed in [Neighborhood Attention Transformer](https://paperswithcode.com/paper/neighborhood-attention-transformer) as an alternative to other local attention mechanisms used in Hierarchical Vision Transformers.\r\n\r\nNA is in concept similar to [stand alone self attention (SASA)](https://paperswithcode.com/method/sasa), in that both can be implemented with a raster scan sliding window operation over the key value pair. However, NA would require a modification to handle corner pixels, which helps maintain a fixed receptive field size and an increased number of relative positions.\r\n\r\nThe primary challenge in experimenting with both NA and SASA has been computation. Simply extracting key values for each query is slow, takes up a large amount of memory, and is eventually intractable at scale. NA was therefore implemented through a new CUDA extension to PyTorch, [NATTEN](https://github.com/SHI-Labs/NATTEN).",
  "title": "Neighborhood Attention Transformer",
  "collection": "Attention Patterns",
  "area": "Natural Language Processing"
}
{
  "name": "Harmonic Block",
  "full_name": "Harmonic Block",
  "description": "A **Harmonic Block** is an image model component that utilizes [Discrete Cosine Transform](https://paperswithcode.com/method/discrete-cosine-transform) (DCT) filters. Convolutional neural networks (CNNs) learn filters in order to capture local correlation patterns in feature space. In contrast, DCT has preset spectral filters, which can be better for compressing information (due to the presence of redundancy in the spectral domain).\r\n\r\nDCT has been successfully used for JPEG encoding to transform image blocks into spectral representations to capture the most information with a small number of coefficients. Harmonic blocks learn how to optimally combine spectral coefficients at every layer to produce a fixed size representation defined as a weighted sum of responses to DCT filters. The use of DCT filters allows to address the task of model compression.",
  "title": "Harmonic Convolutional Networks based on Discrete Cosine Transform",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Canonical Partition",
  "full_name": "Canonical Partition",
  "description": "\\emph{Canonical partition} $\\mathcal{P}$ crops the index-restricted d-hop neighborhood around the center node from the target graph. $\\mathcal{D}(G_t,v_i,v_c)$ means the shortest distance between $v_i$ and $v_c$ on $G_t$.\r\n \\begin{equation}\r\n        \\mathcal{P}(G_t, v_c, d) = G_c, \\operatorname{ s.t. } G_c \\subseteq G_t, V_c = \\{ v_i \\in V_t|\\mathcal{D}(G_t,v_i,v_c) \\leq d , v_i \\leq v_c\\}\r\n\\end{equation}\r\n\r\nThe graph $G_c$ obtained by canonical partition is called the \\emph{canonical neighborhood}. Canonical neighborhoods can correctly substitute the target graph in canonical count. The subgraph count of query in target equals the summation of the canonical count of query in canonical neighborhoods for all target nodes. Canonical neighborhoods are acquired with canonical partition $\\mathcal{P}$, given any $d$ greater than the diameter of the query.\r\n\\begin{equation}\r\n\t\\mathcal{C}(G_q,G_t) = \\sum_{v_c \\in V_t} \\mathcal{C}_c(G_q, \\mathcal{P}(G_t, v_c, d),v_c), \r\n\td \\geq \\max_{v_i, v_j \\in V_q} \\mathcal{D}(G_q, v_i, v_j)\r\n\\end{equation}",
  "title": "DeSCo: Towards Generalizable and Scalable Deep Subgraph Counting",
  "collection": "Graph Data Augmentation",
  "area": "Graphs"
}
{
  "name": "Fast AutoAugment",
  "full_name": "Fast AutoAugment",
  "description": "**Fast AutoAugment** is an image data augmentation algorithm that finds effective augmentation policies via a search strategy based on density matching, motivated by Bayesian DA. The strategy is to improve the generalization performance of a given network by learning the augmentation policies which treat augmented data as missing data points of training data. However, different from Bayesian DA, the proposed method recovers those missing data points by the exploitation-and-exploration of a family of inference-time augmentations via Bayesian optimization in the policy search phase. This is realized by using an efficient density matching algorithm that does not require any back-propagation for network training for each policy evaluation.",
  "title": "Fast AutoAugment",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "PolarNet",
  "full_name": "PolarNet",
  "description": "**PolarNet** is an improved grid representation for online, single-scan LiDAR point clouds. Instead of using common spherical or bird's-eye-view projection, the polar bird's-eye-view representation balances the points across grid cells in a polar coordinate system, indirectly aligning a segmentation network's attention with the long-tailed distribution of the points along the radial axis.",
  "title": "PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation",
  "collection": "Point Cloud Representations",
  "area": "Computer Vision"
}
{
  "name": "NIMA",
  "full_name": "Neural Image Assessment",
  "description": "In the context of image enhancement, maximizing NIMA score as a prior can increase the likelihood of enhancing perceptual quality of an image.",
  "title": "NIMA: Neural Image Assessment",
  "collection": "Discriminators",
  "area": "General"
}
{
  "name": "SimAdapter",
  "full_name": "SimAdapter",
  "description": "**SimAdapter** is a module for explicitly learning knowledge from adapters. SimAdapter aims to learn the similarities between the source and target languages during fine-tuning using the adapters, and the similarity is based on an [attention mechanism](https://paperswithcode.com/methods/category/attention-mechanisms-1). \r\n\r\nThe detailed composition of the SimAdapter is shown in the Figure. By taking the language-agnostic representations from the backbone model as the query, and the language-specific outputs from multiple adapter as the keys and values, the final output for SimAdapter over attention are computed as (For notation simplicity, we omit the layer index $l$ below):\r\n\r\n$$\r\n\\operatorname{SimAdapter}\\left(\\mathbf{z}, \\mathbf{a}\\_{\\left\\(S\\_{1}, S\\_{2}, \\ldots, S\\_{N}\\right\\)}\\right)=\\sum_{i=1}^{N} \\operatorname{Attn}\\left(\\mathbf{z}, \\mathbf{a}\\_{S\\_{i}}\\right) \\cdot\\left(\\mathbf{a}\\_{S\\_{i}} \\mathbf{W}\\_{V}\\right)\r\n$$\r\n\r\nwhere SimAdapter $(\\cdot)$ and $\\operatorname{Attn}(\\cdot)$ denotes the SimAdapter and attention operations, respectively. Specifically, the attention operation is computed as:\r\n\r\n$$\r\n\\operatorname{Attn}(\\mathbf{z}, \\mathbf{a})=\\operatorname{Softmax}\\left(\\frac{\\left(\\mathbf{z} \\mathbf{W}\\_{Q}\\right)\\left(\\mathbf{a} \\mathbf{W}\\_{K}\\right)^{\\top}}{\\tau}\\right)\r\n$$\r\n\r\nwhere $\\tau$ is the temperature coefficient, $\\mathbf{W}\\_{Q}, \\mathbf{W}\\_{K}, \\mathbf{W}\\_{V}$ are attention matrices. Note that while $\\mathbf{W}\\_{Q}, \\mathbf{W}\\_{K}$ are initialized randomly, $\\mathbf{W}\\_{V}$ is initialized with a diagonal of ones and the rest of the matrix with small weights $(1 e-6)$ to retain the adapter representations. Furthermore, a regularization term is introduced to avoid drastic feature changes:\r\n\r\n$$\r\n\\mathcal{L}\\_{\\mathrm{reg}}=\\sum\\_{i, j}\\left(\\left(\\mathbf{I}\\_{V}\\right)\\_{i, j}-\\left(\\mathbf{W}\\_{V}\\right)_{i, j}\\right)^{2}\r\n$$\r\n\r\nwhere $\\mathbf{I}\\_{V}$ is the identity matrix with the same size as $\\mathbf{W}\\_{V}$",
  "title": "Exploiting Adapters for Cross-lingual Low-resource Speech Recognition",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "AutoGAN",
  "full_name": "AutoGAN",
  "description": "[Neural architecture search](https://paperswithcode.com/method/neural-architecture-search) (NAS) has witnessed prevailing success in image classification and (very recently) segmentation tasks. In this paper, we present the first preliminary study on introducing the NAS algorithm to generative adversarial networks (GANs), dubbed AutoGAN. The marriage of NAS and GANs faces its unique challenges. We define the search space for the generator architectural variations and use an RNN controller to guide the search, with parameter sharing and dynamic-resetting to accelerate the process. Inception score is adopted as the reward, and a multi-level search strategy is introduced to perform NAS in a progressive way.",
  "title": "AutoGAN: Neural Architecture Search for Generative Adversarial Networks",
  "collection": "Neural Architecture Search",
  "area": "General"
}
{
  "name": "SNAIL",
  "full_name": "Simple Neural Attention Meta-Learner",
  "description": "The **Simple Neural Attention Meta-Learner**, or **SNAIL**, combines the benefits of temporal convolutions and attention to solve meta-learning tasks. They introduce positional dependence through temporal convolutions to make the model applicable to reinforcement tasks - where the observations, actions, and rewards are intrinsically sequential. They also introduce attention in order to provide pinpoint access over an infinitely large context. SNAIL is constructing by combining the two: we use temporal convolutions to produce the context over which we use a causal attention operation.",
  "title": "A Simple Neural Attentive Meta-Learner",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "Jigsaw",
  "full_name": "Jigsaw",
  "description": "**Jigsaw** is a self-supervision approach that relies on jigsaw-like puzzles as the pretext task in order to learn image representations.",
  "title": "Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "Sscs",
  "full_name": "Support-set Based Cross-Supervision",
  "description": "**Sscs**, or **Support-set Based Cross-Supervision**, is a module for video grounding which consists of two main components: a discriminative contrastive objective and a generative caption objective. The contrastive objective aims to learn effective representations by contrastive learning, while the caption objective can train a powerful video encoder supervised by texts. Due to the co-existence of some visual entities in both ground-truth and background intervals, i.e., mutual exclusion, naively contrastive learning is unsuitable to video grounding. This problem is addressed by boosting the cross-supervision with the support-set concept, which collects visual information from the whole video and eliminates the mutual exclusion of entities.\r\n\r\nSpecifically, in the Figure to the right, two video-text pairs { $V\\_{i}, L\\_{i}$}, {$V\\_{j} , L\\_{j}$ } in the batch are presented for clarity. After feeding them into a video and text encoder, the clip-level and sentence-level embedding ( {$X\\_{i}, Y\\_{i}$} and {$X\\_{j} , Y\\_{j}$} ) in a shared space are acquired. Base on the support-set module, the weighted average of $X\\_{i}$ and $X\\_{j}$ is computed to obtain $\\bar{X}\\_{i}$, $\\bar{X}\\_{j}$ respectively. Finally, the contrastive and caption objectives are combined to pull close the representations of the clips and text from the same samples and push away those from other pairs",
  "title": "Support-Set Based Cross-Supervision for Video Grounding",
  "collection": "Video Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "VideoBERT",
  "full_name": "VideoBERT",
  "description": "VideoBERT adapts the powerful [BERT](https://paperswithcode.com/method/bert) model to learn a joint visual-linguistic representation for video. It is used in numerous tasks, including action classification and video captioning.",
  "title": "VideoBERT: A Joint Model for Video and Language Representation Learning",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "H3DNet",
  "full_name": "H3DNet",
  "description": "Code for paper: H3DNet: 3D Object Detection Using Hybrid Geometric Primitives (ECCV 2020)",
  "title": null,
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "VSF",
  "full_name": "VisuoSpatial Foresight",
  "description": "**VisuoSpatial Foresight** is a method for robotic fabric manipulation that leverages a combination of RGB and depth information to learn goal conditioned fabric manipulation policies for a variety of long horizon tasks.",
  "title": "VisuoSpatial Foresight for Multi-Step, Multi-Task Fabric Manipulation",
  "collection": "Robotic Manipulation Models",
  "area": "General"
}
{
  "name": "Absolute Position Encodings",
  "full_name": "Absolute Position Encodings",
  "description": "**Absolute Position Encodings** are a type of position embeddings for [[Transformer](https://paperswithcode.com/method/transformer)-based models] where positional encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d\\_{model}$ as the embeddings, so that the two can be summed. In the original implementation, sine and cosine functions of different frequencies are used:\r\n\r\n$$ \\text{PE}\\left(pos, 2i\\right) = \\sin\\left(pos/10000^{2i/d\\_{model}}\\right) $$\r\n\r\n$$ \\text{PE}\\left(pos, 2i+1\\right) = \\cos\\left(pos/10000^{2i/d\\_{model}}\\right) $$\r\n\r\nwhere $pos$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\\pi$ to $10000 \\dot 2\\pi$. This function was chosen because the authors hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$,  $\\text{PE}\\_{pos+k}$ can be represented as a linear function of $\\text{PE}\\_{pos}$.\r\n\r\nImage Source: [D2L.ai](https://d2l.ai/chapter_attention-mechanisms/self-attention-and-positional-encoding.html)",
  "title": "Attention Is All You Need",
  "collection": "Position Embeddings",
  "area": "General"
}
{
  "name": "TaLK Convolution",
  "full_name": "Time-aware Large Kernel Convolution",
  "description": "A **Time-aware Large Kernel (TaLK) convolution** is a type of temporal [convolution](https://paperswithcode.com/method/convolution) that learns the kernel size of a summation kernel for each time-step instead of learning the kernel weights as in a typical convolution operation. For each time-step, a function is responsible for predicting the appropriate size of neighbor representations to use in the form of left and right offsets relative to the time-step.",
  "title": "Time-aware Large Kernel Convolutions",
  "collection": "Temporal Convolutions",
  "area": "Sequential"
}
{
  "name": "ACGPN",
  "full_name": "Adaptive Content Generating and Preserving Network",
  "description": "**ACGPN**, or **Adaptive Content Generating and Preserving Network**, is a [generative adversarial network](https://www.paperswithcode.com/method/category/generative-adversarial-network) for virtual try-on clothing applications. \r\n\r\nIn Step I, the Semantic Generation Module (SGM) takes the target clothing image $\\mathcal{T}\\_{c}$, the pose map $\\mathcal{M}\\_{p}$, and the fused body part mask $\\mathcal{M}^{F}$ as the input to predict the semantic layout and to output the synthesized body part mask $\\mathcal{M}^{S}\\_{\\omega}$ and the target clothing mask $\\mathcal{M}^{S\\_{c}$.\r\n\r\nIn Step II, the Clothes Warping Module (CWM) warps the target clothing image to $\\mathcal{T}^{R}\\_{c}$ according to the predicted semantic layout, where a second-order difference constraint is introduced to stabilize the warping process. \r\n\r\nIn Steps III and IV, the Content Fusion Module (CFM) first produces the composited body part mask $\\mathcal{M}^{C}\\_{\\omega}$ using the original clothing mask $\\mathcal{M}\\_{c}$, the synthesized clothing mask $\\mathcal{M}^{S}\\_{c}$, the body part mask $\\mathcal{M}\\_{\\omega}$, and the synthesized body part mask $\\mathcal{M}\\_{\\omega}^{S}$, and then exploits a fusion network to generate the try-on images $\\mathcal{I}^{S}$ by utilizing the information $\\mathcal{T}^{R}\\_{c}$, $\\mathcal{M}^{S}\\_{c}$, and the body part image $I\\_{\\omega}$ from previous steps.",
  "title": "Towards Photo-Realistic Virtual Try-On by Adaptively Generating-Preserving Image Content",
  "collection": "Generative Adversarial Networks",
  "area": "Computer Vision"
}
{
  "name": "Dynamic Keypoint Head",
  "full_name": "Dynamic Keypoint Head",
  "description": "**Dynamic Keypoint Head** is an output head for pose estimation that are conditioned on each instance (person), and can encode the instance concept in the dynamically-generated weights of their filters. They are used in the [FCPose](https://paperswithcode.com/method/fcpose) architecture.\r\n\r\nThe Figure shows the core idea. $F$ denotes a level of feature maps. \"Rel. Coord.\" means the relative coordinates, denoting the relative offsets from the locations of $F$ to the location where the filters are generated. Refer to the text for details. $f\\_{\\theta\\_{i}}$ is the dynamically-generated keypoint head for the $i$-th person instance. Note that each person instance has its own keypoint head.",
  "title": "FCPose: Fully Convolutional Multi-Person Pose Estimation with Dynamic Instance-Aware Convolutions",
  "collection": "Output Heads",
  "area": "Computer Vision"
}
{
  "name": "Bayesian REX",
  "full_name": "Bayesian Reward Extrapolation",
  "description": "**Bayesian Reward Extrapolation** is a Bayesian reward learning algorithm that scales to high-dimensional imitation learning problems by pre-training a low-dimensional feature encoding via self-supervised tasks and then leveraging preferences over demonstrations to perform fast Bayesian inference.",
  "title": "Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences",
  "collection": "Bayesian Reinforcement Learning",
  "area": "Reinforcement Learning"
}
{
  "name": "ERNIE-GEN",
  "full_name": "ERNIE-GEN",
  "description": "**ERNIE-GEN** is a multi-flow sequence to sequence pre-training and fine-tuning framework which bridges the discrepancy between training and inference with an infilling generation mechanism and a noise-aware generation method. To make generation closer to human writing patterns, this framework introduces a span-by-span generation flow that trains the model to predict semantically-complete spans consecutively rather than predicting word by word. Unlike existing pre-training methods, ERNIE-GEN incorporates multi-granularity target sampling to construct pre-training data, which enhances the correlation between encoder and decoder.",
  "title": "ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "PCIDA",
  "full_name": "Probabilistic Continuously Indexed Domain Adaptation",
  "description": "**Probabilistic Continuously Indexed Domain Adaptation** (**PCIDA**) enjoys better theoretical guarantees to match both the mean and variance of the distribution $p(u|z)$. PCIDA can be extended to match higher-order moments.",
  "title": "Continuously Indexed Domain Adaptation",
  "collection": "Adversarial Training",
  "area": "General"
}
{
  "name": "Dreamix",
  "full_name": "Dreamix: video diffusion models are general video editors",
  "description": "",
  "title": "Dreamix: Video Diffusion Models are General Video Editors",
  "collection": "Generative Video Models",
  "area": "Computer Vision"
}
{
  "name": "AMP",
  "full_name": "Adversarial Model Perturbation",
  "description": "Based on the understanding that the flat local minima of the empirical risk cause the model to generalize better. Adversarial Model Perturbation (AMP) improves generalization via minimizing the **AMP loss**, which is obtained from the empirical risk by applying the **worst** norm-bounded perturbation on each point in the parameter space.",
  "title": "Regularizing Neural Networks via Adversarial Model Perturbation",
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "Res2Net Block",
  "full_name": "Res2Net Block",
  "description": "A **Res2Net Block** is an image model block that constructs hierarchical residual-like connections\r\nwithin one single [residual block](https://paperswithcode.com/method/residual-block). It was proposed as part of the [Res2Net](https://paperswithcode.com/method/res2net) CNN architecture.\r\n\r\nThe block represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The $3 \\times 3$ filters of $n$ channels is replaced with a set of smaller filter groups, each with $w$ channels. These smaller filter groups are connected in a hierarchical residual-like style to increase the number of scales that the output features can represent. Specifically, we divide input feature maps into several groups. A group of filters first extracts features from a group of input feature maps. Output features of the previous group are then sent to the next group of filters along with another group of input feature maps. \r\n\r\nThis process repeats several times until all input feature maps are processed. Finally, feature maps from all groups are concatenated and sent to another group of $1 \\times 1$ filters to fuse information altogether. Along with any possible path in which input features are transformed to output features, the equivalent receptive field increases whenever it passes a $3 \\times 3$ filter, resulting in many equivalent feature scales due to combination effects.\r\n\r\nOne way of thinking of these blocks is that they expose a new dimension, **scale**,  alongside the existing dimensions of depth, width, and cardinality.",
  "title": "Res2Net: A New Multi-scale Backbone Architecture",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "Affine Operator",
  "full_name": "Affine Operator",
  "description": "The **Affine Operator** is an affine transformation layer introduced in the [ResMLP](https://paperswithcode.com/method/resmlp) architecture. This replaces [layer normalization](https://paperswithcode.com/method/layer-normalization), as in [Transformer based networks](https://paperswithcode.com/methods/category/transformers), which is possible since in the ResMLP, there are no [self-attention layers](https://paperswithcode.com/method/scaled) which makes training more stable - hence allowing a more simple affine transformation.\r\n\r\nThe affine operator is defined as:\r\n\r\n$$ \\operatorname{Aff}_{\\mathbf{\\alpha}, \\mathbf{\\beta}}(\\mathbf{x})=\\operatorname{Diag}(\\mathbf{\\alpha}) \\mathbf{x}+\\mathbf{\\beta} $$\r\n\r\nwhere $\\alpha$ and $\\beta$ are learnable weight vectors. This operation only rescales and shifts the input element-wise. This operation has several advantages over other normalization operations: first, as opposed to Layer Normalization, it has no cost at inference time, since it can absorbed in the adjacent linear layer. Second, as opposed to [BatchNorm](https://paperswithcode.com/method/batch-normalization) and Layer Normalization, the Aff operator does not depend on batch statistics.",
  "title": "ResMLP: Feedforward networks for image classification with data-efficient training",
  "collection": "Feedforward Networks",
  "area": "General"
}
{
  "name": "Shape Adaptor",
  "full_name": "Shape Adaptor",
  "description": "**Shape Adaptor** is a novel resizing module for neural networks. It is a drop-in enhancement built on top of traditional resizing layers, such as pooling, bilinear sampling, and strided [convolution](https://paperswithcode.com/method/convolution). This module allows for a learnable shaping factor which differs from the traditional resizing layers that are fixed and deterministic.\r\n\r\nImage Source: [Liu et al.](https://arxiv.org/pdf/2008.00892v2.pdf)",
  "title": "Shape Adaptor: A Learnable Resizing Module",
  "collection": "AutoML",
  "area": "General"
}
{
  "name": "GPipe",
  "full_name": "GPipe",
  "description": "**GPipe** is a distributed model parallel method for neural networks. With GPipe, each model can be specified as a sequence of layers, and consecutive groups of layers can be partitioned into cells. Each cell is then placed on a separate accelerator. Based on this partitioned setup, batch splitting is applied. A mini-batch of training examples is split into smaller micro-batches, then the execution of each set of micro-batches is pipelined over cells. Synchronous mini-batch gradient descent is applied for training, where gradients are accumulated across all micro-batches in a mini-batch and applied at the end of a mini-batch.",
  "title": "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "AutoAugment",
  "full_name": "AutoAugment",
  "description": "**AutoAugment** is an automated approach to find data augmentation policies from data. It formulates the problem of finding the best augmentation policy as a discrete search problem. It consists of two components: a search algorithm and a search space. \r\n\r\nAt a high level, the search algorithm (implemented as a controller RNN) samples a data augmentation policy $S$, which has information about what image processing operation to use, the probability of using the operation in each batch, and the magnitude of the operation. The policy $S$ is used to train a neural network with a fixed architecture, whose validation accuracy $R$ is sent back to update the controller. Since $R$ is not differentiable, the controller will be updated by policy gradient methods. \r\n\r\nThe operations used are from PIL, a popular Python image library: all functions in PIL that accept an image as input and output an image. It additionally uses two other augmentation techniques: [Cutout](https://paperswithcode.com/method/cutout) and SamplePairing. The operations searched over are ShearX/Y, TranslateX/Y, Rotate, AutoContrast, Invert, Equalize, Solarize, Posterize, Contrast, Color, Brightness, Sharpness, Cutout and Sample Pairing.",
  "title": "AutoAugment: Learning Augmentation Policies from Data",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "CoT Prompting",
  "full_name": "Chain-of-thought prompting",
  "description": "Chain-of-thought prompts contain a series of intermediate reasoning steps, and they are shown to significantly improve the ability of large language models to perform certain tasks that involve complex reasoning (e.g., arithmetic, commonsense reasoning, symbolic reasoning, etc.)",
  "title": "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models",
  "collection": "Prompt Engineering",
  "area": "General"
}
{
  "name": "AdaGrad",
  "full_name": "AdaGrad",
  "description": "**AdaGrad** is a stochastic optimization method that adapts the learning rate to the parameters. It performs smaller updates for parameters associated with frequently occurring features, and larger updates for parameters associated with infrequently occurring features. In its update rule, Adagrad modifies the general learning rate $\\eta$ at each time step $t$ for every parameter $\\theta\\_{i}$ based on the past gradients for $\\theta\\_{i}$: \r\n\r\n$$ \\theta\\_{t+1, i} = \\theta\\_{t, i} - \\frac{\\eta}{\\sqrt{G\\_{t, ii} + \\epsilon}}g\\_{t, i} $$\r\n\r\nThe benefit of AdaGrad is that it eliminates the need to manually tune the learning rate; most leave it at a default value of $0.01$. Its main weakness is the accumulation of the squared gradients in the denominator. Since every added term is positive, the accumulated sum keeps growing during training, causing the learning rate to shrink and becoming infinitesimally small.\r\n\r\nImage: [Alec Radford](https://twitter.com/alecrad)",
  "title": null,
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Split Attention",
  "full_name": "Split Attention",
  "description": "A **Split Attention** block enables attention across feature-map groups. As in [ResNeXt blocks](https://paperswithcode.com/method/resnext-block), the feature can be divided into several groups, and the number of feature-map groups is given by a cardinality hyperparameter $K$. The resulting feature-map groups are called cardinal groups. Split Attention blocks introduce a new radix hyperparameter $R$ that indicates the number of splits within a cardinal group, so the total number of feature groups is $G = KR$. We may apply a series of transformations {$\\mathcal{F}\\_1, \\mathcal{F}\\_2, \\cdots\\mathcal{F}\\_G$} to each individual group, then the intermediate representation of each group is $U\\_i = \\mathcal{F}\\_i\\left(X\\right)$, for $i \\in$ {$1, 2, \\cdots{G}$}.\r\n\r\nA combined representation for each cardinal group can be obtained by fusing via an element-wise summation across multiple splits. The representation for $k$-th cardinal group is \r\n$\\hat{U}^k = \\sum_{j=R(k-1)+1}^{R k} U_j $, where $\\hat{U}^k \\in \\mathbb{R}^{H\\times W\\times C/K}$ for $k\\in{1,2,...K}$, and $H$, $W$ and $C$ are the block output feature-map sizes. \r\nGlobal contextual information with embedded channel-wise statistics can be gathered with [global average pooling](https://paperswithcode.com/method/global-average-pooling) across spatial dimensions  $s^k\\in\\mathbb{R}^{C/K}$. Here the $c$-th component is calculated as:\r\n\r\n$$\r\n    s^k\\_c = \\frac{1}{H\\times W} \\sum\\_{i=1}^H\\sum\\_{j=1}^W \\hat{U}^k\\_c(i, j).\r\n$$\r\n\r\nA weighted fusion of the cardinal group representation $V^k\\in\\mathbb{R}^{H\\times W\\times C/K}$ is aggregated using [channel-wise soft attention](https://paperswithcode.com/method/channel-wise-soft-attention), where each feature-map channel is produced using a weighted combination over splits. The $c$-th channel is calculated as:\r\n\r\n$$\r\n    V^k_c=\\sum_{i=1}^R a^k_i(c) U_{R(k-1)+i} ,\r\n$$\r\n\r\nwhere $a_i^k(c)$ denotes a (soft) assignment weight given by:\r\n\r\n$$\r\na_i^k(c) =\r\n\\begin{cases}\r\n  \\frac{exp(\\mathcal{G}^c_i(s^k))}{\\sum_{j=0}^R exp(\\mathcal{G}^c_j(s^k))} & \\quad\\textrm{if } R>1, \\\\\r\n   \\frac{1}{1+exp(-\\mathcal{G}^c_i(s^k))} & \\quad\\textrm{if } R=1,\\\\\r\n\\end{cases}\r\n$$\r\n\r\nand mapping $\\mathcal{G}_i^c$ determines the weight of each split for the $c$-th channel based on the global context representation $s^k$.",
  "title": "ResNeSt: Split-Attention Networks",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "IoU-guided NMS",
  "full_name": "IoU-guided NMS",
  "description": "**IoU-guided NMS** is a type of non-maximum suppression that help to eliminate the suppression failure caused by the misleading classification confidences. This is achieved through using the predicted IoU instead of the classification confidence as the ranking keyword for bounding boxes. ",
  "title": "Acquisition of Localization Confidence for Accurate Object Detection",
  "collection": "Proposal Filtering",
  "area": "Computer Vision"
}
{
  "name": "SIFA",
  "full_name": "Synergistic Image and Feature Alignment",
  "description": "**Synergistic Image and Feature Alignment** is an unsupervised domain adaptation framework that conducts synergistic alignment of domains from both image and feature perspectives. In SIFA, we simultaneously transform the appearance of images across domains and enhance domain-invariance of the extracted features by leveraging adversarial learning in multiple aspects and with a deeply supervised mechanism. The feature encoder is shared between both adaptive perspectives to leverage their mutual benefits via end-to-end learning.",
  "title": "Unsupervised Bidirectional Cross-Modality Adaptation via Deeply Synergistic Image and Feature Alignment for Medical Image Segmentation",
  "collection": "Domain Adaptation",
  "area": "General"
}
{
  "name": "Label Quality Model",
  "full_name": "Label Quality Model",
  "description": "**Label Quality Model** is an intermediate supervised task aimed at predicting the clean labels from noisy labels by leveraging rater features and a paired subset for supervision. The LQM technique assumes the existence of rater features and a subset of training data with both noisy and clean labels, which we call paired-subset. In real world scenarios, some level of label noise may be unavoidable. The LQM approach still works as long as the clean(er) label is less noisy than a label from a rater that is randomly selected from the pool, e.g., clean labels can be from either expert raters or aggregation of multiple raters. LQM is trained on the paired-subset using rater features and noisy label as input, and inferred on the entire training corpus. The output of LQM is used during model training as a more accurate alternative to the noisy labels.",
  "title": "An Instance-Dependent Simulation Framework for Learning with Label Noise",
  "collection": "Label Correction",
  "area": "General"
}
{
  "name": "DAFNe",
  "full_name": "DAFNe",
  "description": "**DAFNe** is a dense one-stage anchor-free deep model for oriented object detection. It is a deep neural network that performs predictions on a dense grid over the input image, being architecturally simpler in design, as well as easier to optimize than its two-stage counterparts. Furthermore, it reduces the prediction complexity by refraining from employing bounding box anchors. This enables a tighter fit to oriented objects, leading to a better separation of bounding boxes especially in case of dense object distributions. Moreover, it introduces an orientation-aware generalization of the center-ness function to arbitrary quadrilaterals that takes into account the object's orientation and that, accordingly, accurately down-weights low-quality predictions",
  "title": "DAFNe: A One-Stage Anchor-Free Approach for Oriented Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "Gradient Checkpointing",
  "full_name": "Gradient Checkpointing",
  "description": "**Gradient Checkpointing** is a method used for reducing the memory footprint when training deep neural networks, at the cost of having a small increase in computation time.",
  "title": "Training Deep Nets with Sublinear Memory Cost",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "SRGAN Residual Block",
  "full_name": "SRGAN Residual Block",
  "description": "**SRGAN Residual Block** is a residual block used in the [SRGAN](https://paperswithcode.com/method/srgan) generator for image super-resolution. It is similar to standard [residual blocks](https://paperswithcode.com/method/residual-block), although it uses a [PReLU](https://paperswithcode.com/method/prelu) activation function to help training (preventing sparse gradients during [GAN](https://paperswithcode.com/method/gan) training).",
  "title": "Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "REINFORCE",
  "full_name": "REINFORCE",
  "description": "**REINFORCE** is a Monte Carlo variant of a policy gradient algorithm in reinforcement learning. The agent collects samples of an episode using its current policy, and uses it to update the policy parameter $\\theta$. Since one full trajectory must be completed to construct a sample space, it is updated as an off-policy algorithm.\r\n\r\n$$ \\nabla\\_{\\theta}J\\left(\\theta\\right) = \\mathbb{E}\\_{\\pi}\\left[G\\_{t}\\nabla\\_{\\theta}\\ln\\pi\\_{\\theta}\\left(A\\_{t}\\mid{S\\_{t}}\\right)\\right]$$\r\n\r\nImage Credit: [Tingwu Wang](http://www.cs.toronto.edu/~tingwuwang/REINFORCE.pdf)",
  "title": null,
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "Monte Carlo Dropout",
  "full_name": "Monte Carlo Dropout",
  "description": "",
  "title": "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning",
  "collection": "Interpretability",
  "area": "General"
}
{
  "name": "GrowNet",
  "full_name": "GrowNet",
  "description": "**GrowNet** is a novel approach to combine the power of gradient boosting to incrementally build complex deep neural networks out of shallow components. It introduces a versatile framework that can readily be adapted for a diverse range of machine learning tasks in a wide variety of domains.",
  "title": "Gradient Boosting Neural Networks: GrowNet",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "ReLUN",
  "full_name": "Rectified Linear Unit N",
  "description": "The **Rectified Linear Unit N**, or **ReLUN**, is a modification of **[ReLU6](https://paperswithcode.com/method/relu6)** activation function that has trainable parameter **n**.\r\n\r\n$$ReLUN(x) = min(max(0, x), n)$$",
  "title": "Trainable Activations for Image Classification",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "ComplEx-N3",
  "full_name": "ComplEx with N3 Regularizer",
  "description": "ComplEx model trained with a nuclear norm regularizer",
  "title": "Canonical Tensor Decomposition for Knowledge Base Completion",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "HiFi-GAN",
  "full_name": "HiFi-GAN",
  "description": "**HiFi-GAN** is a generative adversarial network for speech synthesis. HiFi-GAN consists of one generator and two discriminators: multi-scale and multi-period discriminators. The generator and discriminators are trained adversarially, along with two additional losses for improving training stability and model performance.\r\n\r\nThe generator is a fully convolutional neural network. It uses a mel-spectrogram as input and upsamples it through transposed convolutions until the length of the output sequence matches the temporal resolution of raw waveforms. Every [transposed convolution](https://paperswithcode.com/method/transposed-convolution) is followed by a multi-receptive field fusion (MRF) module.\r\n\r\nFor the discriminator, a multi-period discriminator (MPD) is used consisting of several sub-discriminators each handling a portion of periodic signals of input audio. Additionally, to capture consecutive patterns and long-term dependencies, the multi-scale discriminator (MSD) proposed in [MelGAN](https://paperswithcode.com/method/melgan) is used, which consecutively evaluates audio samples at different levels.",
  "title": "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis",
  "collection": "Generative Audio Models",
  "area": "Audio"
}
{
  "name": "ShuffleNet V2 Downsampling Block",
  "full_name": "ShuffleNet V2 Downsampling Block",
  "description": "**ShuffleNet V2 Downsampling Block** is a block for spatial downsampling used in the [ShuffleNet V2](https://paperswithcode.com/method/shufflenet-v2) architecture. Unlike the regular [ShuffleNet](https://paperswithcode.com/method/shufflenet) V2 block, the channel split operator is removed so the number of output channels is doubled.",
  "title": "ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "GPT-Neo",
  "full_name": "GPT-Neo",
  "description": "An implementation of model & data parallel [GPT3-like](https://paperswithcode.com/method/gpt-3) models using the [mesh-tensorflow](https://github.com/tensorflow/mesh) library.\r\n\r\nSource: [EleutherAI/GPT-Neo](https://github.com/EleutherAI/gpt-neo)",
  "title": null,
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "VL-BERT",
  "full_name": "Visual-Linguistic BERT",
  "description": "VL-BERT is pre-trained on a large-scale image-captions dataset together with text-only corpus. The input to the model are either words from the input sentences or regions-of-interest (RoI) from input images. It can be fine-tuned to fit most visual-linguistic downstream tasks. Its backbone is a multi-layer bidirectional Transformer encoder, modified to accommodate visual contents, and new type of visual feature embedding to the input feature embeddings. VL-BERT takes both visual and linguistic elements as input, represented as RoIs in images and subwords in input sentences. Four different types of embeddings are used to represent each input: token embedding, visual feature embedding, segment embedding, and sequence position embedding. VL-BERT is pre-trained using Conceptual Captions and text-only datasets. Two pre-training tasks are used: masked language modeling with visual clues, and masked RoI classification with linguistic clues.",
  "title": "VL-BERT: Pre-training of Generic Visual-Linguistic Representations",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "MBS",
  "full_name": "Model-based Subsampling",
  "description": "To avoid the problem caused by low-frequent entity-relation pairs, our MBS uses the estimated probabilities from a trained model $\\mathbf{\\theta}'$ to calculate frequencies for each triplet and query. By using $\\mathbf{\\theta}'$, the NS loss in KGE with MBS is represented as follows:\r\n\\begin{align}\r\n    &\\ell_{mbs}(\\mathbf{\\theta};\\mathbf{\\theta}') \\nonumber \\\\\r\n=&-\\frac{1}{|D|}\\sum_{(x,y) \\in D} \\Bigl[A_{mbs}(\\mathbf{\\theta}')\\log(\\sigma(s_{\\mathbf{\\theta}}(x,y)+\\gamma))\\nonumber\\\\\r\n    &+\\frac{1}{\\nu}sum_{y_{i}\\sim p_n(y_{i}|x)}^{\\nu}B_{mbs}(\\mathbf{\\theta}')\\log(\\sigma(-s_{\\mathbf{\\theta}}(x,y_i)-\\gamma))\\Bigr],\r\n\\end{align}",
  "title": "Model-based Subsampling for Knowledge Graph Completion",
  "collection": "Negative Sampling",
  "area": "General"
}
{
  "name": "Hard Swish",
  "full_name": "Hard Swish",
  "description": "**Hard Swish** is a type of activation function based on [Swish](https://paperswithcode.com/method/swish), but replaces the computationally expensive sigmoid with a piecewise linear analogue:\r\n\r\n$$\\text{h-swish}\\left(x\\right) = x\\frac{\\text{ReLU6}\\left(x+3\\right)}{6} $$",
  "title": "Searching for MobileNetV3",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Feedforward Network",
  "full_name": "Feedforward Network",
  "description": "A **Feedforward Network**, or a **Multilayer Perceptron (MLP)**, is a neural network with solely densely connected layers. This is the classic neural network architecture of the literature. It consists of inputs $x$ passed through units $h$ (of which there can be many layers) to predict a target $y$. Activation functions are generally chosen to be non-linear to allow for flexible functional approximation.\r\n\r\nImage Source: Deep Learning, Goodfellow et al",
  "title": null,
  "collection": "Feedforward Networks",
  "area": "General"
}
{
  "name": "MHMA",
  "full_name": "Multi-Heads of Mixed Attention",
  "description": "The multi-head of mixed attention combines both self- and cross-attentions, encouraging high-level learning of interactions between entities captured in the various attention features. It is build with several attention heads, each of the head can implement either self or cross attention. A self attention is when the key and query features are the same or come from the same domain features. A cross attention is when the key and query features are generated from different features. Modeling MHMA allows a model to identity the relationship between features of different domains. This is very useful in tasks involving relationship modeling such as human-object interaction, tool-tissue interaction, man-machine interaction, human-computer interface, etc.",
  "title": "Rendezvous: Attention Mechanisms for the Recognition of Surgical Action Triplets in Endoscopic Videos",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "IterInpaint",
  "full_name": "Iterative Inpainting",
  "description": "",
  "title": "Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation",
  "collection": "Image Generation Models",
  "area": "Computer Vision"
}
{
  "name": "3D Convolution",
  "full_name": "3D Convolution",
  "description": "A **3D Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) where the kernel slides in 3 dimensions as opposed to 2 dimensions with 2D convolutions. One example use case is medical imaging where a model is constructed using 3D image slices. Additionally video based data has an additional temporal dimension over images making it suitable for this module. \r\n\r\nImage: Lung nodule detection based on 3D convolutional neural networks, Fan et al",
  "title": null,
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "LAMB",
  "full_name": "LAMB",
  "description": "**LAMB** is a a layerwise adaptive large batch optimization technique. It provides a strategy for adapting the learning rate in large batch settings. LAMB uses [Adam](https://paperswithcode.com/method/adam) as the base algorithm and then forms an update as:\r\n\r\n$$r\\_{t} = \\frac{m\\_{t}}{\\sqrt{v\\_{t}} + \\epsilon}$$\r\n$$x\\_{t+1}^{\\left(i\\right)} = x\\_{t}^{\\left(i\\right)}  - \\eta\\_{t}\\frac{\\phi\\left(|| x\\_{t}^{\\left(i\\right)} ||\\right)}{|| m\\_{t}^{\\left(i\\right)} || }\\left(r\\_{t}^{\\left(i\\right)}+\\lambda{x\\_{t}^{\\left(i\\right)}}\\right) $$\r\n\r\nUnlike [LARS](https://paperswithcode.com/method/lars), the adaptivity of LAMB is two-fold: (i) per dimension normalization with respect to the square root of the second moment used in Adam and (ii) layerwise normalization obtained due to layerwise adaptivity.",
  "title": "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes",
  "collection": "Large Batch Optimization",
  "area": "General"
}
{
  "name": "DualGCN",
  "full_name": "Dual Graph Convolutional Networks",
  "description": "A dual graph convolutional neural network jointly considers the two essential assumptions of semi-supervised learning: (1) local consistency and (2) global consistency. Accordingly, two convolutional neural networks are devised to embed the local-consistency-based and global-consistency-based knowledge, respectively.\r\n\r\nDescription  and image from: [Dual Graph Convolutional Networks for Graph-Based Semi-Supervised Classification](https://persagen.com/files/misc/zhuang2018dual.pdf)",
  "title": null,
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "G-GLN",
  "full_name": "Gaussian Gated Linear Network",
  "description": "**Gaussian Gated Linear Network**, or **G-GLN**, is a multi-variate extension to the recently proposed [GLN](https://paperswithcode.com/method/gln) family of deep neural networks by reformulating the GLN neuron as a gated product of Gaussians. This Gaussian Gated Linear Network (G-GLN) formulation exploits the fact that exponential family densities are closed under multiplication, a property that has seen much use in [Gaussian Process](https://paperswithcode.com/method/gaussian-process) and related literature. Similar to the Bernoulli GLN, every neuron in the G-GLN directly predicts the target distribution.  \r\n\r\nPrecisely, a G-GLN is a feed-forward network of data-dependent distributions. Each neuron calculates the sufficient statistics $\\left(\\mu, \\sigma\\_{2}\\right)$ for its associated PDF using its active weights, given those emitted by neurons in the preceding layer. It consists of consists of $L+1$ layers indexed by $i \\in\\{0, \\ldots, L\\}$ with $K\\_{i}$ neurons in each layer. The weight space for a neuron in layer $i$ is denoted by $\\mathcal{W}\\_{i}$; the subscript is needed since the dimension of the weight space depends on $K_{i-1}$. Each neuron/distribution is indexed by its position in the network when laid out on a grid; for example, $f\\_{i k}$ refers to the family of PDFs defined by the $k$ th neuron in the $i$ th layer. Similarly, $c\\_{i k}$ refers to the context function associated with each neuron in layers $i \\geq 1$, and $\\mu\\_{i k}$ and $\\sigma\\_{i k}^{2}$ (or $\\Sigma\\_{i k}$ in the multivariate case) referring to the sufficient statistics for each Gaussian PDF.\r\n\r\nThere are two types of input to neurons in the network. The first is the side information, which can be thought of as the input features, and is used to determine the weights used by each neuron via half-space gating. The second is the input to the neuron, which is the PDFs output by the previous layer, or in the case of layer 0, some provided base models. To apply a G-GLN in a supervised learning setting, we need to map the sequence of input-label pairs $\\left(x\\_{t}, y\\_{t}\\right)$ for $t=1,2, \\ldots$ onto a sequence of (side information, base Gaussian PDFs, label) triplets $\\left(z\\_{t},\\left\\(f\\_{0 i}\\right\\)\\_{i}, y\\_{t}\\right)$. The side information $z\\_{t}$ is set to the (potentially normalized) input features $x\\_{t}$. The Gaussian PDFs for layer 0 will generally include the necessary base Gaussian PDFs to span the target range, and optionally some base prediction PDFs that capture domain-specific knowledge.",
  "title": "Gaussian Gated Linear Networks",
  "collection": "Gated Linear Networks",
  "area": "General"
}
{
  "name": "UCTransNet",
  "full_name": "UCTransNet",
  "description": "**UCTransNet** is an end-to-end deep learning network for semantic segmentation that takes [U-Net](https://paperswithcode.com/method/u-net) as the main structure of the network. The original skip connections of U-Net are replaced by CTrans consisting of two components: [Channel-wise Cross fusion Transformer](https://paperswithcode.com/method/channel-wise-cross-fusion-transformer) ([CCT](https://paperswithcode.com/method/cct)) and [Channel-wise Cross Attention](https://paperswithcode.com/method/channel-wise-cross-attention) (CCA) to guide the fused multi-Scale channel-wise information to effectively connect to the decoder features for eliminating the ambiguity.",
  "title": "UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "SDNE",
  "full_name": "Structural Deep Network Embedding",
  "description": "",
  "title": "Structural Deep Network Embedding",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "Powerpropagation",
  "full_name": "Powerpropagation",
  "description": "**Powerpropagation** is a weight-parameterisation for neural networks that leads to inherently sparse models. Exploiting the behaviour of gradient descent, it gives rise to weight updates exhibiting a “rich get richer” dynamic, leaving low-magnitude parameters largely unaffected by learning.In other words, parameters with larger magnitudes are allowed to adapt faster in order to represent the required features to solve the task, while smaller magnitude parameters are restricted, making it more likely that they will be irrelevant in representing the learned solution.  Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.",
  "title": "Powerpropagation: A sparsity inducing weight reparameterisation",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Triplet Attention",
  "full_name": "Triplet Attention",
  "description": "Triplet attention comprises of three branches each responsible for capturing crossdimension between the spatial dimensions and channel dimension of the input. Given an input tensor with shape (C × H × W), each branch is responsible for aggregating cross-dimensional interactive features between either the spatial dimension H or W and the channel dimension C.",
  "title": "Rotate to Attend: Convolutional Triplet Attention Module",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "SiLU",
  "full_name": "Sigmoid Linear Unit",
  "description": "** Sigmoid Linear Units**, or **SiLUs**, are activation functions for\r\nneural networks. The activation of the SiLU is computed by the sigmoid function multiplied by its input, or $$ x\\sigma(x).$$\r\n\r\nSee [Gaussian Error Linear Units](https://arxiv.org/abs/1606.08415) ([GELUs](https://paperswithcode.com/method/gelu)) where the SiLU was originally coined, and see [Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning](https://arxiv.org/abs/1702.03118) and [Swish: a Self-Gated Activation Function](https://arxiv.org/abs/1710.05941v1) where the SiLU was experimented with later.",
  "title": "Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Local Mixup",
  "full_name": "Local Mixup",
  "description": "",
  "title": "Preventing Manifold Intrusion with Locality: Local Mixup",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "FreeAnchor",
  "full_name": "FreeAnchor",
  "description": "**FreeAnchor** is an anchor supervision method for object detection. Many CNN-based object detectors assign anchors for ground-truth objects under the restriction of object-anchor Intersection-over-Unit (IoU). In contrast, FreeAnchor is a learning-to-match approach that breaks the IoU restriction, allowing objects to match anchors in a flexible manner. It updates hand-crafted anchor assignment to free anchor matching by formulating detector training as a maximum likelihood estimation (MLE) procedure. FreeAnchor targets at learning features which best explain a class of objects in terms of both classification and localization.",
  "title": "FreeAnchor: Learning to Match Anchors for Visual Object Detection",
  "collection": "Anchor Supervision",
  "area": "Computer Vision"
}
{
  "name": "DAU-ConvNet",
  "full_name": "Displaced Aggregation Units",
  "description": "**Displaced Aggregation Unit** replaces classic [convolution](https://paperswithcode.com/method/convolution) layer in ConvNets with learnable positions of units.  This introduces explicit structure of hierarchical compositions and results in several benefits:\r\n\r\n* fully adjustable and **learnable receptive fields** through spatially-adjustable filter units\r\n* **reduced parameters** for spatial coverage\r\nefficient inference\r\n* **decupling** of the parameters from the receptive field sizes\r\n\r\nMore information can be found [here.](https://www.vicos.si/Research/DeepCompositionalNet)",
  "title": "Spatially-Adaptive Filter Units for Compact and Efficient Deep Neural Networks",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "Dynamic Convolution",
  "full_name": "Dynamic Convolution",
  "description": "The extremely low computational cost of lightweight CNNs constrains the depth and width of the networks, further decreasing their representational power. To address the above problem, Chen et al. proposed dynamic convolution, a novel operator design that increases  representational power with negligible additional computational cost and does not change the width or depth of the network in parallel with CondConv.\r\n\r\nDynamic convolution uses $K$ parallel convolution kernels of the same  size and input/output dimensions instead of one kernel per layer. Like SE blocks, it adopts a squeeze-and-excitation mechanism to generate the attention weights for the different convolution kernels. These kernels are then aggregated dynamically by weighted summation and applied to the input feature map $X$:\r\n\\begin{align}\r\n    s & = \\text{softmax} (W_{2} \\delta (W_{1}\\text{GAP}(X)))\r\n\\end{align}\r\n\\begin{align}\r\n    \\text{DyConv} &= \\sum_{i=1}^{K} s_k \\text{Conv}_k \r\n\\end{align}\r\n\\begin{align}\r\n    Y &= \\text{DyConv}(X)\r\n\\end{align}\r\nHere the convolutions are combined by summation of weights and biases of convolutional kernels. \r\n\r\nCompared to applying convolution to the feature map, the computational cost of squeeze-and-excitation and weighted summation is extremely low. Dynamic convolution thus provides an efficient operation to improve  representational power and can be easily used as a replacement for any convolution.",
  "title": "Dynamic Convolution: Attention over Convolution Kernels",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "ChebNet",
  "full_name": "ChebNet",
  "description": "ChebNet involves a formulation of CNNs in the context of spectral graph theory, which provides the necessary mathematical background and efficient numerical schemes to design fast localized convolutional filters on graphs.\r\n\r\nDescription from: [Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering](https://arxiv.org/pdf/1606.09375.pdf)",
  "title": "Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "Nyströmformer",
  "full_name": "Nyströmformer",
  "description": "Nyströmformer replaces the self-attention in [BERT](https://paperswithcode.com/method/bert)-small and BERT-base using the proposed Nyström approximation. This reduces self-attention complexity to $O(n)$ and allows the [Transformer](https://paperswithcode.com/method/transformer) to support longer sequences.",
  "title": "Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Fast-YOLOv2",
  "full_name": "Fast-YOLOv2",
  "description": "",
  "title": "YOLO9000: Better, Faster, Stronger",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Truncation Trick",
  "full_name": "Truncation Trick",
  "description": "The **Truncation Trick** is a latent sampling procedure for generative adversarial networks, where we sample $z$ from a truncated normal (where values which fall outside a range are resampled to fall inside that range). \r\nThe original implementation was in [Megapixel Size Image Creation with GAN](https://paperswithcode.com/paper/megapixel-size-image-creation-using).\r\nIn [BigGAN](http://paperswithcode.com/method/biggan), the authors find this provides a boost to the Inception Score and FID.",
  "title": "Megapixel Size Image Creation using Generative Adversarial Networks",
  "collection": "Latent Variable Sampling",
  "area": "General"
}
{
  "name": "FM with splines",
  "full_name": "Factorization machines with cubic splines for numerical features",
  "description": "Using cubic splines to improve factorization machine accuracy with numerical features",
  "title": "Basis Function Encoding of Numerical Features in Factorization Machines for Improved Accuracy",
  "collection": "Factorization Machines",
  "area": "General"
}
{
  "name": "TSRUc",
  "full_name": "TSRUc",
  "description": "**TSRUc**, or **Transformation-based Spatial Recurrent Unit c**, is a modification of a [ConvGRU](https://paperswithcode.com/method/cgru) used in the [TriVD-GAN](https://paperswithcode.com/method/trivd-gan) architecture for video generation.\r\n\r\nInstead of computing the reset gate $r$ and resetting $h\\_{t−1}$, the TSRUc computes the parameters of a transformation $\\theta$, which we use to warp $h\\_{t−1}$. The rest of our model is unchanged (with $\\hat{h}\\_{t-1}$ playing the role of $h'\\_{t}$ in $c$’s update equation from ConvGRU. The TSRUc module is described by the following equations:\r\n\r\n$$ \\theta\\_{h,x} = f\\left(h\\_{t−1}, x\\_{t}\\right) $$\r\n\r\n$$ \\hat{h}\\_{t-1} = w\\left(h\\_{t-1}; \\theta\\_{h, x}\\right) $$\r\n\r\n$$ c = \\rho\\left(W\\_{c} \\star\\_{n}\\left[\\hat{h}\\_{t-1};x\\_{t}\\right] + b\\_{c} \\right) $$\r\n\r\n$$ u = \\sigma\\left(W\\_{u} \\star\\_{n}\\left[h\\_{t-1};x\\_{t}\\right] + b\\_{u} \\right) $$\r\n\r\n$$ h\\_{t} = u \\odot h\\_{t-1} + \\left(1-u\\right) \\odot c $$\r\n\r\nIn these equations $\\sigma$ and $\\rho$ are the elementwise sigmoid and [ReLU](https://paperswithcode.com/method/relu) functions respectively and the $\\star\\_{n}$ represents a [convolution](https://paperswithcode.com/method/convolution) with a kernel of size $n \\times n$. Brackets are used to represent a feature concatenation.",
  "title": "Transformation-based Adversarial Video Prediction on Large-Scale Data",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "Cyclical Learning Rate Policy",
  "full_name": "Cyclical Learning Rate Policy",
  "description": "A **Cyclical Learning Rate Policy** combines a linear learning rate decay with warm restarts.\r\n\r\nImage: [ESPNetv2](https://paperswithcode.com/method/espnetv2)",
  "title": null,
  "collection": "Learning Rate Schedules",
  "area": "General"
}
{
  "name": "XCiT Layer",
  "full_name": "XCiT Layer",
  "description": "An **XCiT Layer** is the main building block of the [XCiT](https://paperswithcode.com/method/xcit) architecture which uses a [cross-covariance attention]() operator as its principal operation. The XCiT layer consists of three main blocks, each preceded by [LayerNorm](https://paperswithcode.com/method/layer-normalization) and followed by a [residual connection](https://paperswithcode.com/method/residual-connection): (i) the core [cross-covariance attention](https://paperswithcode.com/method/cross-covariance-attention) (XCA) operation, (ii) the [local patch interaction](https://paperswithcode.com/method/local-patch-interaction) (LPI) module, and (iii) a [feed-forward network](https://paperswithcode.com/method/feedforward-network) (FFN). By transposing the query-key interaction, the computational complexity of XCA is linear in the number of data elements N, rather than quadratic as in conventional self-attention.",
  "title": "XCiT: Cross-Covariance Image Transformers",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Adaptive Dropout",
  "full_name": "Adaptive Dropout",
  "description": "**Adaptive Dropout** is a regularization technique that extends dropout by allowing the dropout probability to be different for different units. The intuition is that there may be hidden units that can individually make confident predictions for the presence or absence of an important feature or combination of features. [Dropout](https://paperswithcode.com/method/dropout) will ignore this confidence and drop the unit out 50% of the time. \r\n\r\nDenote the activity of unit $j$ in a deep neural network by $a\\_{j}$ and assume that its inputs are {$a\\_{i}: i < j$}. In dropout, $a\\_{j}$ is randomly set to zero with probability 0.5. Let $m\\_{j}$ be a binary variable that is used to mask, the activity $a\\_{j}$, so that its value is:\r\n\r\n$$ a\\_{j} = m\\_{j}g \\left( \\sum\\_{i: i<j}w\\_{j, i}a\\_{i} \\right)$$\r\n\r\nwhere $w\\_{j,i}$ is the weight from unit $i$ to unit $j$ and $g\\left(·\\right)$ is the activation function and $a\\_{0} = 1$ accounts for biases. Whereas in standard dropout, $m\\_{j}$ is Bernoulli with probability $0.5$, adaptive dropout uses adaptive dropout probabilities that depends on input activities:\r\n\r\n$$ P\\left(m\\_{j} = 1\\mid{\\{a\\_{i}: i < j\\}}\\right) = f \\left( \\sum\\_{i: i<j}\\pi{\\_{j, i}a\\_{i}} \\right) $$\r\n\r\nwhere $\\pi\\_{j, i}$ is the weight from unit $i$ to unit $j$ in the standout network or the adaptive dropout network; $f(·)$ is a sigmoidal function. Here 'standout' refers to a binary belief network is that is overlaid on a neural network as part of the overall regularization technique.",
  "title": "Adaptive dropout for training deep neural networks",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "ACTKR",
  "full_name": "ACTKR",
  "description": "**ACKTR**, or **Actor Critic with Kronecker-factored Trust Region**, is an actor-critic method for reinforcement learning that applies [trust region optimization](https://paperswithcode.com/method/trpo) using a recently proposed Kronecker-factored approximation to the curvature. The method extends the framework of natural policy gradient and optimizes both the actor and the critic using Kronecker-factored approximate\r\ncurvature (K-FAC) with trust region.",
  "title": "Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "Exact Fusion Model",
  "full_name": "Exact Fusion Model",
  "description": "**Exact Fusion Model (EFM)** is a method for aggregating a feature pyramid. The EFM is based on [YOLOv3](https://paperswithcode.com/method/yolov3), which assigns exactly one bounding-box prior to each ground truth object. Each ground truth bounding box corresponds to one anchor box that surpasses the threshold IoU. If the size of an anchor box is equivalent to the field-of-view of the grid cell, then for the grid cells of the $s$-th scale, the corresponding bounding box will be lower bounded by the $(s − 1)$th scale and upper bounded by the (s + 1)th scale. Therefore, the EFM assembles features from the three scales.",
  "title": "CSPNet: A New Backbone that can Enhance Learning Capability of CNN",
  "collection": "Feature Pyramid Blocks",
  "area": "Computer Vision"
}
{
  "name": "LapStyle",
  "full_name": "Laplacian Pyramid Network",
  "description": "**LapStyle**, or **Laplacian Pyramid Network**, is a feed-forward style transfer method. It uses a [Drafting Network](https://paperswithcode.com/method/drafting-network) to transfer global style patterns in low-resolution, and adopts higher resolution [Revision Networks](https://paperswithcode.com/method/revision-network) to revise local styles in a pyramid manner according to outputs of multi-level Laplacian filtering of the content image. Higher resolution details can be generated by stacking Revision Networks with multiple Laplacian pyramid levels. The final stylized image is obtained by aggregating outputs of all pyramid levels.\r\n\r\nSpecifically, we first generate image pyramid $\\left\\(\\bar{x}\\_{c}, r\\_{c}\\right\\)$ from content image $x\\_{c}$ with the help of Laplacian filter. Rough low-resolution stylized image are then generated by the Drafting Network. Then the Revision Network generates stylized detail image in high resolution. Then the final stylized image is generated by aggregating the outputs pyramid. $L, C$ and $A$ in an image represent Laplacian, concatenate and aggregation operation separately.",
  "title": "Drafting and Revision: Laplacian Pyramid Network for Fast High-Quality Artistic Style Transfer",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "CornerNet-Squeeze",
  "full_name": "CornerNet-Squeeze",
  "description": "**CornerNet-Squeeze** is an object detector that extends [CornerNet](https://paperswithcode.com/method/cornernet) with a new compact hourglass architecture that makes use of fire modules with depthwise separable convolutions.",
  "title": "CornerNet-Lite: Efficient Keypoint Based Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "NFN",
  "full_name": "Neo-fuzzy-neuron",
  "description": "**Neo-fuzzy-neuron** is a type of artificial neural network that combines the characteristics of both fuzzy logic and neural networks. It uses a fuzzy inference system to model non-linear relationships between inputs and outputs, and a feedforward neural network to learn the parameters of the fuzzy system. The combination of these two approaches provides a flexible and powerful tool for solving a wide range of problems in areas such as pattern recognition, control, and prediction.",
  "title": "Neo-fuzzy-neuron based new approach to system modeling, with application to actual system",
  "collection": "Adaptive Activation Functions",
  "area": "General"
}
{
  "name": "Flan-T5",
  "full_name": "Flan-T5",
  "description": "**Flan-T5** is the instruction fine-tuned version of **T5** or **Text-to-Text Transfer Transformer** Language Model.",
  "title": "Scaling Instruction-Finetuned Language Models",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "BLIP",
  "full_name": "BLIP: Bootstrapping Language-Image Pre-training",
  "description": "Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. Code, models, and datasets are released at https://github.com/salesforce/BLIP.",
  "title": "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "Restricted Boltzmann Machine",
  "full_name": "Restricted Boltzmann Machine",
  "description": "**Restricted Boltzmann Machines**, or **RBMs**, are two-layer generative neural networks that learn a probability distribution over the inputs. They are a special class of Boltzmann Machine in that they have a restricted number of connections between visible and hidden units. Every node in the visible layer is connected to every node in the hidden layer, but no nodes in the same group are connected. RBMs are usually trained using the contrastive divergence learning procedure.\r\n\r\nImage Source: [here](https://medium.com/datatype/restricted-boltzmann-machine-a-complete-analysis-part-1-introduction-model-formulation-1a4404873b3)",
  "title": null,
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "DistanceNet",
  "full_name": "DistanceNet",
  "description": "**DistanceNet** is a learning algorithm for multi-source domain adaptation that uses various distance measures, or a mixture of these distance measures, as an additional loss function to be minimized jointly with the task's loss function, so as to achieve better unsupervised domain adaptation.",
  "title": "Multi-Source Domain Adaptation for Text Classification via DistanceNet-Bandits",
  "collection": "Domain Adaptation",
  "area": "General"
}
{
  "name": "HyperDenseNet",
  "full_name": "HyperDenseNet",
  "description": "Recently, [dense connections](https://paperswithcode.com/method/dense-connections) have attracted substantial attention in computer vision because they facilitate gradient flow and implicit deep supervision during training. Particularly, [DenseNet](https://paperswithcode.com/method/densenet) that connects each layer to every other layer in a feed-forward fashion and has shown impressive performances in natural image classification tasks. We propose HyperDenseNet, a 3-D fully convolutional neural network that extends the definition of dense connectivity to multi-modal segmentation problems. Each imaging modality has a path, and dense connections occur not only between the pairs of layers within the same path but also between those across different paths. This contrasts with the existing multi-modal CNN approaches, in which modeling several modalities relies entirely on a single joint layer (or level of abstraction) for fusion, typically either at the input or at the output of the network. Therefore, the proposed network has total freedom to learn more complex combinations between the modalities, within and in-between all the levels of abstraction, which increases significantly the learning representation. We report extensive evaluations over two different and highly competitive multi-modal brain tissue segmentation challenges, iSEG 2017 and MRBrainS 2013, with the former focusing on six month infant data and the latter on adult images. HyperDenseNet yielded significant improvements over many state-of-the-art segmentation networks, ranking at the top on both benchmarks. We further provide a comprehensive experimental analysis of features re-use, which confirms the importance of hyper-dense connections in multi-modal representation learning.",
  "title": "HyperDense-Net: A hyper-densely connected CNN for multi-modal image segmentation",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "Non-Linear-Bounding-Function",
  "full_name": "Lower Bound on Transmission Using Non-Linear Bounding Function in Single Image Dehazing",
  "description": "",
  "title": "Lower Bound on Transmission Using Non-Linear Bounding Function in Single Image Dehazing",
  "collection": "Non-Parametric Regression",
  "area": "General"
}
{
  "name": "SimAug",
  "full_name": "Simulation as Augmentation",
  "description": "**SimAug**, or **Simulation as Augmentation**, is a data augmentation method for trajectory prediction. It augments the representation such that it is robust to the variances in semantic scenes and camera views.  First, to deal with the gap between real and synthetic semantic scene, it represents each training trajectory by high-level scene semantic segmentation features, and defends the model from adversarial examples generated by whitebox attack methods. Second, to overcome the changes in camera views, it generates multiple views for the same trajectory, and encourages the model to focus on the “hardest” view to which the model has learned. The classification loss is adopted and the view with the highest loss is favored during training. Finally, the augmented trajectory is computed as a convex combination of the trajectories generated in previous steps. The trajectory prediction model is built on a multi-scale representation and the final model is trained to minimize the empirical vicinal risk over the distribution of augmented trajectories.",
  "title": "SimAug: Learning Robust Representations from 3D Simulation for Pedestrian Trajectory Prediction in Unseen Cameras",
  "collection": "Adversarial Training",
  "area": "General"
}
{
  "name": "1-bit LAMB",
  "full_name": "1-bit LAMB",
  "description": "**1-bit LAMB** is a communication-efficient stochastic optimization technique which introduces a novel way to support adaptive layerwise learning rates even when communication is compressed. Learning from the insights behind [1-bit Adam](https://paperswithcode.com/method/1-bit-adam), it is a a 2-stage algorithm which uses [LAMB](https://paperswithcode.com/method/lamb) (warmup stage) to “pre-condition” a communication compressed momentum SGD algorithm (compression stage). At compression stage where original LAMB algorithm cannot be used to update the layerwise learning rates, 1-bit LAMB employs a novel way to adaptively scale layerwise learning rates based on information from both warmup and compression stages. As a result, 1-bit LAMB is able to achieve large batch optimization (LAMB)’s convergence speed under compressed communication.\r\n\r\nThere are two major differences between 1-bit LAMB and the original LAMB:\r\n\r\n- During compression stage, 1-bit LAMB updates the layerwise learning rate based on a novel “reconstructed gradient” based on the compressed momentum. This makes 1-bit LAMB compatible with error compensation and be able to keep track of the training dynamic under compression.\r\n- 1-bit LAMB also introduces extra stabilized soft thresholds when updating layerwise learning rate at compression stage, which makes training more stable under compression.",
  "title": "1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "PO3D-VQA",
  "full_name": "Parts, Poses, and Occlusions in 3D Visual Question Answering",
  "description": "A VQA model that marries two powerful ideas: probabilistic neural symbolic program execution for reasoning and a deep neural network with 3D generative representations of objects for robust visual scene parsing.",
  "title": "3D-Aware Visual Question Answering about Parts, Poses and Occlusions",
  "collection": "Multi-Modal Methods",
  "area": "Computer Vision"
}
{
  "name": "SGD with Momentum",
  "full_name": "SGD with Momentum",
  "description": "### Why SGD with  Momentum?\r\nIn deep learning, we have used stochastic gradient descent as one of the optimizers because at the end we will find the minimum weight and bias at which the model loss is lowest. In the SGD we have some issues in which the SGD does not work perfectly because in deep learning we got a non-convex cost function graph and if use the simple SGD then it leads to low performance. There are 3 main reasons why it does not work:\r\n\r\n<img src=\"https://www.cs.umd.edu/~tomg/img/landscapes/shortHighRes.png\" alt=\"Non-convex graph\" style=\"width:400px;height :300px;\" />\r\n\r\n1) We end up in local minima and not able to reach global minima\r\nAt the start, we randomly start at some point and we are going to end up at the local minimum and not able to reach the global minimum.\r\n\r\n2) Saddle Point will be the stop for reaching global minima\r\nA saddle point is a point where in one direction the surface goes in the upward direction and in another direction it goes downwards. So that the slope is changing very gradually so the speed of changing is going to slow and as result, the training also going to slow.\r\n\r\n3) High curvature can be a reason\r\nThe larger radius leads to low curvature and vice-versa. It will be difficult to traverse in the large curvature which was generally high in non-convex optimization.\r\nBy using the SGD with Momentum optimizer we can overcome the problems like high curvature, consistent gradient, and noisy gradient.\r\n\r\n### What is SGD with Momentum?\r\nSGD with  Momentum is one of the optimizers which is used to improve the performance of the neural network.\r\n\r\nLet's take an example and understand the intuition behind the optimizer suppose we have a ball which is sliding from the start of the slope as it goes the speed of the bowl is increased over time. If we have one point A and we want to reach point B and we don't know in which direction to move but we ask for the 4 points which have already reached point B. If all 4 points are pointing you in the same direction then the confidence of the A is more and it goes in the direction pointed very fast. This is the main concept behind the SGD with Momentum.\r\n\r\n<img src=\"https://cdn-images-1.medium.com/max/1000/1*zNbZqU_uDIV13c9ZCJOEXA.jpeg\" alt=\"Non-convex graph\" style=\"width:400px;height :250px;\" />\r\n### How does SGD with Momentum work?\r\nSo first to understand the concept of exponentially weighted moving average (EWMA). It was a technique through which try to find the trend in time series data. The formula of the EWMA is :\r\n\r\n<img src=\"https://cdn-images-1.medium.com/max/1000/1*O9Wcq-mbRgNOdRNTivSefw.png\" alt=\"Non-convex graph\" style=\"width:400px;height :100px;\" />\r\n\r\n In the formula, β represents the weightage that is going to assign to the past values of the gradient. The values of β is from 0 < β < 1. If the value of the beta is 0.5 then it means that the 1/1–0.5 = 2 so it represents that the calculated average was from the previous 2 readings. \r\n\r\nThe value of Vt depends on β. The higher the value of β the more we try to get an average of more past data and vice-versa. For example, let's take the value of β 0.98 and 0.5 for two different scenarios so if we do 1/1-β then we get 50 and 10 respectively so it was clear that to calculate the average we take past 50 and 10 outcomes respectively for both cases.\r\nNow in SGD with Momentum, we use the same concept of EWMA. Here we introduce the term velocity v which is used to denote the change in the gradient to get to the global minima. The change in the weights is denoted by the formula:\r\n\r\n<img src=\"https://cdn-images-1.medium.com/max/1000/0*i_r3u7LACa6dQyXd\" alt=\"Non-convex graph\" style=\"width:400px;height :100px;\" />\r\n\r\nthe β part of the V formula denotes and is useful to compute the confidence or we can say the past velocity for calculating Vt we have to calculate Vt-1 and for calculating Vt-1 we have to calculate Vt-2 and likewise. So we are using the history of velocity to calculate the momentum and this is the part that provides acceleration to the formula.\r\n\r\n<img src=\"https://cdn-images-1.medium.com/max/1000/1*L5lNKxAHLPYNc6-Zs4Vscw.png\" alt=\"Non-convex graph\" style=\"width:300px;height :100px;\" />\r\n\r\nHere we have to consider two cases:\r\n1. β=0 then, as per the formula weight updating is going to just work as a Stochastic gradient descent. Here we called β a decaying factor because it is defining the speed of past velocity.\r\n\r\n2. β=1 then, there will be no decay. It involves the dynamic equilibrium which is not desired so we generally use the value of β like 0.9,0.99or 0.5 only.\r\n\r\n### Advantages of SGD with Momentum :\r\n1. Momentum is faster than stochastic gradient descent the training will be faster than SGD.\r\n2. Local minima can be an escape and reach global minima due to the momentum involved.\r\n\r\n<img src=\"https://cdn-images-1.medium.com/max/1000/1*Nb39bHHUWGXqgisr2WcLGQ.gif\" alt=\"Non-convex graph\" style=\"width:400px;height :300px;\" />\r\n\r\nHere in the video, we can see that purple is SGD with Momentum and light blue is for SGD the SGD with Momentum can reach global minima whereas SGD is stuck in local minima.\r\nBut there is a catch, the momentum itself can be a problem sometimes because of the high momentum after reaching global minima it is still fluctuating and take some time to get stable at global minima. And that kind of behavior leads to time consumption which makes SGD with Momentum slower than other optimization out there but still faster than SGD.",
  "title": null,
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "TimeSformer",
  "full_name": "TimeSformer",
  "description": "**TimeSformer** is a [convolution](https://paperswithcode.com/method/convolution)-free approach to video classification built exclusively on self-attention over space and time. It adapts the standard [Transformer](https://paperswithcode.com/method/transformer) architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Specifically, the method adapts the image model [[Vision Transformer](https://paperswithcode.com/method/vision-transformer)](https//www.paperswithcode.com/method/vision-transformer) (ViT) to video by extending the self-attention mechanism from the image space to the space-time 3D volume. As in ViT, each patch is linearly mapped into an embedding and augmented with positional information. This makes it possible to interpret the resulting sequence of vector",
  "title": "Is Space-Time Attention All You Need for Video Understanding?",
  "collection": "Generative Video Models",
  "area": "Computer Vision"
}
{
  "name": "SKNet",
  "full_name": "SKNet",
  "description": "**SKNet** is a type of convolutional neural network that employs [selective kernel](https://paperswithcode.com/method/selective-kernel) units, with selective kernel convolutions, in its architecture. This allows for a type of attention where the network can learn to attend to different receptive fields.",
  "title": null,
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "SOM",
  "full_name": "Self-Organizing Map",
  "description": "The **Self-Organizing Map (SOM)**, commonly also known as Kohonen network (Kohonen 1982, Kohonen 2001) is a computational method for the visualization and analysis of high-dimensional data, especially experimentally acquired information.\r\n\r\nExtracted from [scholarpedia](http://www.scholarpedia.org/article/Self-organizing_map)\r\n\r\n**Sources**:\r\n\r\nImage: [scholarpedia](http://www.scholarpedia.org/article/File:Somnbc.png)\r\n\r\nPaper: [Kohonen, T. Self-organized formation of topologically correct feature maps. Biol. Cybern. 43, 59–69 (1982)](https://doi.org/10.1007/BF00337288)\r\n\r\nBook: [Self-Organizing Maps](https://doi.org/10.1007/978-3-642-56927-2)",
  "title": null,
  "collection": "Clustering",
  "area": "General"
}
{
  "name": "ProxylessNAS",
  "full_name": "ProxylessNAS",
  "description": "**ProxylessNAS** directly learns neural network architectures on the target task and target hardware without any proxy task. Additional contributions include:\r\n\r\n- Using a new path-level pruning perspective for [neural architecture search](https://paperswithcode.com/method/neural-architecture-search), showing a close connection between NAS and model compression. Memory consumption is saved by one order of magnitude by using path-level binarization.\r\n- Using a novel gradient-based approach (latency regularization loss) for handling hardware objectives (e.g. latency). Given different hardware platforms: CPU/GPU/Mobile, ProxylessNAS enables hardware-aware neural network specialization that’s exactly optimized for the target hardware.",
  "title": "ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware",
  "collection": "Neural Architecture Search",
  "area": "General"
}
{
  "name": "GMI",
  "full_name": "Graphic Mutual Information",
  "description": "**Graphic Mutual Information**, or **GMI**, measures the correlation between input graphs and high-level hidden representations. GMI generalizes the idea of conventional mutual information computations from vector space to the graph domain where measuring mutual information from two aspects of node features and topological structure is indispensable. GMI exhibits several benefits: First, it is invariant to the isomorphic transformation of input graphs---an inevitable constraint in many existing graph representation learning algorithms; Besides, it can be efficiently estimated and maximized by current mutual information estimation methods such as MINE.",
  "title": "Graph Representation Learning via Graphical Mutual Information Maximization",
  "collection": "Graph Representation Learning",
  "area": "Graphs"
}
{
  "name": "Adaptively Sparse Transformer",
  "full_name": "Adaptively Sparse Transformer",
  "description": "The **Adaptively Sparse Transformer** is a type of [Transformer](https://paperswithcode.com/method/transformer).",
  "title": "Adaptively Sparse Transformers",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Handwritten OCR",
  "full_name": "Handwritten OCR augmentation",
  "description": "We are introducing a universal handwritten image augmentation method that is language-agnostic. This groundbreaking technique can be applied to handwritten images in any language worldwide, marking it as the first of its kind. There are four methods for handwritten images which are ThickOCR, ThinOCR, Elongate OCR, Line Erase OCR.",
  "title": "Handwritten image augmentation",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "Adaptive Feature Pooling",
  "full_name": "Adaptive Feature Pooling",
  "description": "**Adaptive Feature Pooling** pools features from all levels for each proposal in object detection and fuses them for the following prediction. For each proposal, we map them to different feature levels. Following the idea of [Mask R-CNN](https://paperswithcode.com/method/adaptive-feature-pooling), [RoIAlign](https://paperswithcode.com/method/roi-align) is used to pool feature grids from each level. Then a fusion operation (element-wise max or sum) is utilized to fuse feature grids from different levels.\r\n\r\nThe motivation for this technique is that in an [FPN](https://paperswithcode.com/method/fpn) we assign proposals to different feature levels based on the size of proposals, which could be suboptimal if images with small differences are assigned to different levels, or if the importance of features is not strongly correlated to their level which they belong.",
  "title": "Path Aggregation Network for Instance Segmentation",
  "collection": "Pooling Operations",
  "area": "Computer Vision"
}
{
  "name": "Attention-augmented Convolution",
  "full_name": "Attention-augmented Convolution",
  "description": "**Attention-augmented Convolution** is a type of [convolution](https://paperswithcode.com/method/convolution) with a two-dimensional relative self-attention mechanism that can replace convolutions as a stand-alone computational primitive for image classification. It employs [scaled-dot product attention](https://paperswithcode.com/method/scaled) and [multi-head attention](https://paperswithcode.com/method/multi-head-attention) as with [Transformers](https://paperswithcode.com/method/transformer).\r\n\r\nIt works by concatenating convolutional and attentional feature map. To see this, consider an original convolution operator with kernel size $k$, $F\\_{in}$ input filters and $F\\_{out}$ output filters. The corresponding attention augmented convolution can be written as\"\r\n\r\n$$\\text{AAConv}\\left(X\\right) = \\text{Concat}\\left[\\text{Conv}(X), \\text{MHA}(X)\\right] $$\r\n\r\n$X$ originates from an input tensor of shape $\\left(H, W, F\\_{in}\\right)$. This is flattened to become $X \\in \\mathbb{R}^{HW \\times F\\_{in}}$ which is passed into a multi-head attention module, as well as a convolution (see above).\r\n\r\nSimilarly to the convolution, the attention augmented convolution 1) is equivariant to translation and 2) can readily operate on inputs of different spatial dimensions.",
  "title": "Attention Augmented Convolutional Networks",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "Multi-band MelGAN",
  "full_name": "Multi-band MelGAN",
  "description": "**Multi-band MelGAN**, or **MB-MelGAN**, is a waveform generation model focusing on high-quality text-to-speech. It improves the original [MelGAN](https://paperswithcode.com/method/melgan) in several ways. First, it increases the receptive field of the generator, which is proven to be beneficial to speech generation. Second, it substitutes the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech. Lastly, [MelGAN](https://paperswithcode.com/method/melgan) is extended with multi-band processing: the generator takes mel-spectrograms as input and produces sub-band signals which are subsequently summed back to full-band signals as discriminator input.",
  "title": "Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech",
  "collection": "Generative Audio Models",
  "area": "Audio"
}
{
  "name": "Mixture Normalization",
  "full_name": "Mixture Normalization",
  "description": "**Mixture Normalization** is normalization technique that relies on an approximation of the probability density function of the internal representations. Any continuous distribution can be approximated with arbitrary precision using a Gaussian Mixture Model (GMM). Hence, instead of computing one set of statistical measures from the entire population (of instances in the mini-batch) as [Batch Normalization](https://paperswithcode.com/method/batch-normalization) does, Mixture Normalization works on sub-populations which can be identified by disentangling modes of the distribution, estimated via GMM. \r\n\r\nWhile BN can only scale and/or shift the whole underlying probability density function, mixture normalization operates like a soft piecewise normalizing transform, capable of completely re-structuring the data distribution by independently scaling and/or shifting individual modes of distribution.",
  "title": "Training Faster by Separating Modes of Variation in Batch-normalized Models",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "VC R-CNN",
  "full_name": "Visual Commonsense Region-based Convolutional Neural Network",
  "description": "**VC R-CNN** is an unsupervised feature representation learning method, which uses Region-based Convolutional Neural Network ([R-CNN](https://paperswithcode.com/method/r-cnn)) as the visual backbone, and the causal intervention as the training objective. Given a set of detected object regions in an image (e.g., using [Faster R-CNN](https://paperswithcode.com/method/faster-r-cnn)), like any other unsupervised feature learning methods (e.g., word2vec), the proxy training objective of VC R-CNN is to predict the contextual objects of a region. However, they are fundamentally different: the prediction of VC R-CNN is by using causal intervention: P(Y|do(X)), while others are by using the conventional likelihood: P(Y|X). This is also the core reason why VC R-CNN can learn \"sense-making\" knowledge like chair can be sat -- while not just \"common\" co-occurrences such as the chair is likely to exist if table is observed.",
  "title": "Visual Commonsense R-CNN",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "PGNet",
  "full_name": "Point Gathering Network",
  "description": "**PGNet** is a point-gathering network for reading arbitrarily-shaped text in real-time. It is a single-shot text spotter, where the pixel-level character classification map is learned with proposed PG-CTC loss avoiding the usage of character-level annotations. With a PG-CTC decoder, we gather high-level character classification vectors from two-dimensional space and decode them into text symbols without NMS and RoI operations involved, which guarantees high efficiency. Additionally, reasoning the relations between each character and its neighbors, a graph refinement module (GRM) is proposed to optimize the coarse recognition and improve the end-to-end performance.",
  "title": "PGNet: Real-time Arbitrarily-Shaped Text Spotting with Point Gathering Network",
  "collection": "Scene Text Models",
  "area": "Computer Vision"
}
{
  "name": "Elastic Dense Block",
  "full_name": "Elastic Dense Block",
  "description": "**Elastic Dense Block** is a skip connection block that modifies the [Dense Block](https://paperswithcode.com/method/dense-block) with downsamplings and upsamplings in parallel branches at each layer to let the network learn from a data scaling policy in which inputs are processed at different resolutions in each layer. It is called \"elastic\" because each layer in the network is flexible in terms of choosing the best scale by a soft policy.",
  "title": "ELASTIC: Improving CNNs with Dynamic Scaling Policies",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "RotNet",
  "full_name": "RotNet",
  "description": "**RotNet** is a self-supervision approach that relies on predicting image rotations as the pretext task\r\nin order to learn image representations.",
  "title": "RotNet: Fast and Scalable Estimation of Stellar Rotation Periods Using Convolutional Neural Networks",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "Class-MLP",
  "full_name": "Class-MLP",
  "description": "**Class-MLP** is an alternative to [average pooling](https://paperswithcode.com/method/average-pooling), which is an adaptation of the class-attention token introduced in [CaiT](https://paperswithcode.com/method/cait). In CaiT, this consists of two layers that have the same structure as the [transformer](https://paperswithcode.com/method/transformer), but in which only the class token is updated based on the frozen patch embeddings. In Class-MLP, the same approach is used, but after aggregating the patches with a [linear layer](https://paperswithcode.com/method/linear-layer), we replace the [attention-based interaction](https://paperswithcode.com/method/scaled) between the class and patch embeddings by simple linear layers, still keeping the patch embeddings frozen. This increases the performance, at the expense of adding some parameters and computational cost. This pooling variant is referred to as “class-MLP”, since the purpose of these few layers is to replace average pooling.",
  "title": "ResMLP: Feedforward networks for image classification with data-efficient training",
  "collection": "Pooling Operations",
  "area": "Computer Vision"
}
{
  "name": "CABiNet",
  "full_name": "Context Aggregated Bi-lateral Network for Semantic Segmentation",
  "description": "With the increasing demand of autonomous systems, pixelwise semantic segmentation for visual scene understanding needs to be not only accurate but also efficient for potential real-time applications. In this paper, we propose Context Aggregation Network, a dual branch convolutional neural network, with significantly lower computational costs as compared to the state-of-the-art, while maintaining a competitive prediction accuracy. Building upon the existing dual branch architectures for high-speed semantic segmentation, we design a high resolution branch for effective spatial detailing and a context branch with light-weight versions of global aggregation and local distribution blocks, potent to capture both long-range and local contextual dependencies required for accurate semantic segmentation, with low computational overheads. We evaluate our method on two semantic segmentation datasets, namely Cityscapes dataset and UAVid dataset. For Cityscapes test set, our model achieves state-of-the-art results with mIOU of 75.9%, at 76 FPS on an NVIDIA RTX 2080Ti and 8 FPS on a Jetson Xavier NX. With regards to UAVid dataset, our proposed network achieves mIOU score of 63.5% with high execution speed (15 FPS).",
  "title": null,
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "MAD Learning",
  "full_name": "Memory-Associated Differential Learning",
  "description": "**Memory-Associated Differential** (**MAD**) Learning was developed to inference from the memorized facts that we already know to predict what we want to know.\r\n\r\nImage source: [Luo et al.](https://arxiv.org/pdf/2102.05246v1.pdf)",
  "title": "Memory-Associated Differential Learning",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "TD-VAE",
  "full_name": "TD-VAE",
  "description": "**TD-VAE**, or **Temporal Difference VAE**, is a generative sequence model that learns representations containing explicit beliefs about states several steps into the future, and that can be rolled out directly without single-step transitions. TD-VAE is trained on pairs of temporally separated time points, using an analogue of [temporal difference learning](https://paperswithcode.com/method/td-lambda) used in reinforcement learning.",
  "title": "Temporal Difference Variational Auto-Encoder",
  "collection": "Generative Sequence Models",
  "area": "Sequential"
}
{
  "name": "SortCut Sinkhorn Attention",
  "full_name": "SortCut Sinkhorn Attention",
  "description": "**SortCut Sinkhorn Attention** is a variant of [Sparse Sinkhorn Attention](https://paperswithcode.com/method/sparse-sinkhorn-attention) where a post-sorting truncation of the input sequence is performed, essentially performing a hard top-k operation on the input sequence blocks within the computational graph. While most attention models mainly re-weight or assign near-zero weights during training, this allows for explicitly and dynamically truncate the input sequence. Specifically:\r\n\r\n$$ Y = \\text{Softmax}\\left(Q{\\psi\\_{S}}\\left(K\\right)^{T}\\_{\\left[:n\\right]}\\right)\\psi\\_{S}\\left(V\\right)\\_{\\left[:n\\right]} $$\r\n\r\nwhere $n$ is the Sortfut budget hyperparameter.",
  "title": "Sparse Sinkhorn Attention",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "ELiSH",
  "full_name": "Exponential Linear Squashing Activation",
  "description": "The **Exponential Linear Squashing Activation Function**, or **ELiSH**, is an activation function used for neural networks. It shares common properties with [Swish](https://paperswithcode.com/method/swish), being made up of an [ELU](https://paperswithcode.com/method/elu) and a [Sigmoid](https://paperswithcode.com/method/sigmoid-activation):\r\n\r\n$$f\\left(x\\right) = \\frac{x}{1+e^{-x}} \\text{ if } x \\geq 0 $$\r\n$$f\\left(x\\right) = \\frac{e^{x} - 1}{1+e^{-x}} \\text{ if } x < 0 $$\r\n\r\nThe Sigmoid part of **ELiSH** improves information flow, while the linear parts solve issues of vanishing gradients.",
  "title": "The Quest for the Golden Activation Function",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Adabelief",
  "full_name": "Adabelief",
  "description": "",
  "title": "AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "SAINT",
  "full_name": "SAINT",
  "description": "**SAINT** is a hybrid deep learning approach to solving tabular data problems. SAINT performs attention over both rows and columns, and it includes an enhanced embedding method. The architecture, pre-training and training pipeline are as follows: \r\n\r\n- $L$ layers with 2 attention blocks each, one self-attention block, and a novel intersample attention blocks that computes attention across samples are used.\r\n- For pre-training, this involves minimizing the contrastive and denoising losses between a given data point and its views generated by [CutMix](https://paperswithcode.com/method/cutmix) and [mixup](https://paperswithcode.com/method/mixup). During finetuning/regular training, data passes through an embedding layer and then the SAINT model. Lastly, the contextual embeddings from SAINT are used to pass only the embedding corresponding to the CLS token through an [MLP](https://paperswithcode.com/method/feedforward-network) to obtain the final prediction.",
  "title": "SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "IB-BERT",
  "full_name": "Inverted Bottleneck BERT",
  "description": "**IB-BERT**, or **Inverted Bottleneck BERT**, is a [BERT](https://paperswithcode.com/method/bert) variant that uses an [inverted bottleneck](https://paperswithcode.com/method/inverted-residual-block) structure. It is used as a teacher network to train the [MobileBERT](https://paperswithcode.com/method/mobilebert) models.",
  "title": "MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "HOPE",
  "full_name": "High-Order Proximity preserved Embedding",
  "description": "",
  "title": "Asymmetric Transitivity Preserving Graph Embedding",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "IPL",
  "full_name": "Iterative Pseudo-Labeling",
  "description": "**Iterative Pseudo-Labeling** (IPL) is a semi-supervised algorithm for speech recognition which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine tunes an existing model at each iteration using both labeled data and a subset of unlabeled data.",
  "title": "Iterative Pseudo-Labeling for Speech Recognition",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "Distributed Shampoo",
  "full_name": "Distributed Shampoo",
  "description": "A scalable second order optimization algorithm for deep learning.\r\n\r\nOptimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and/or second order statistics of the data, are far less prevalent despite strong theoretical properties, due to their prohibitive computation, memory and communication costs. In an attempt to bridge this gap between theoretical and practical optimization, we present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad), that along with several critical algorithmic and numerical improvements, provides significant convergence and wall-clock time improvements compared to conventional first-order methods on state-of-the-art deep models. Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units. We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50.",
  "title": "Towards Practical Second Order Optimization for Deep Learning",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Attention Sinks",
  "full_name": "Attention Sinks",
  "description": "Please enter a description about the method here",
  "title": "Efficient Streaming Language Models with Attention Sinks",
  "collection": "Attention",
  "area": "General"
}
{
  "name": "Disentangled Attribution Curves",
  "full_name": "Disentangled Attribution Curves",
  "description": "**Disentangled Attribution Curves (DAC)** provide interpretations of tree ensemble methods in the form of (multivariate) feature importance curves. For a given variable, or group of variables, [DAC](https://paperswithcode.com/method/dac) plots the importance of a variable(s) as their value changes.\r\n\r\nThe Figure to the right shows an example. The tree depicts a decision tree which performs binary classification using two features (representing the XOR function). In this problem, knowing the value of one of the features without knowledge of the other feature yields no information - the classifier still has a 50% chance of predicting either class. As a result, DAC produces curves which assign 0 importance to either feature on its own. Knowing both features yields perfect information about the classifier, and thus the DAC curve for both features together correctly shows that the interaction of the features produces the model’s predictions.",
  "title": "Disentangled Attribution Curves for Interpreting Random Forests and Boosted Trees",
  "collection": "Interpretability",
  "area": "General"
}
{
  "name": "SEER",
  "full_name": "SEER",
  "description": "**SEER** is a self-supervised learning approach for training large models on random, uncurated images with no supervision. It trains [RegNet-Y](https://paperswithcode.com/method/regnet-y) architectures with the [SwAV](https://paperswithcode.com/method/swav). Several adjustments are made to self-supervised training to make it work at a larger scale, including using a [cosine learning schedule](https://paperswithcode.com/method/cosine-annealing)",
  "title": "Self-supervised Pretraining of Visual Features in the Wild",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "Varifocal Loss",
  "full_name": "Varifocal Loss",
  "description": "**Varifocal Loss** is a loss function for training a dense object detector to predict the IACS, inspired by [focal loss](https://paperswithcode.com/method/focal-loss). Unlike the focal loss that deals with positives and negatives equally, Varifocal Loss treats them asymmetrically.\r\n\r\n$$ VFL\\left(p, q\\right) = −q\\left(q\\log\\left(p\\right) + \\left(1 − q\\right)\\log\\left(1 − p\\right)\\right) \\text{ if } q > 0 $$\r\n\r\n$$ VFL\\left(p, q\\right) = −\\alpha{p^{\\gamma}}\\log\\left(1-p\\right) $$\r\n\r\nwhere $p$ is the predicted IACS and $q$ is the target IoU score.\r\n\r\nFor a positive training example, $q$ is set as the IoU between the generated bounding box and the ground-truth one\r\n(gt IoU), whereas for a negative training example, the training target $q$ for all classes is $0$.",
  "title": "VarifocalNet: An IoU-aware Dense Object Detector",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "PICARD",
  "full_name": "Parsing Incrementally for Constrained Auto-Regressive Decoding",
  "description": "",
  "title": "PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models",
  "collection": "Code Generation Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "lda2vec",
  "full_name": "lda2vec",
  "description": "**lda2vec** builds representations over both words and documents by mixing word2vec’s skipgram architecture with Dirichlet-optimized sparse topic mixtures. \r\n\r\nThe Skipgram Negative-Sampling (SGNS) objective of word2vec is modified to utilize document-wide feature vectors while simultaneously learning continuous document weights loading onto topic vectors. The total loss term $L$ is the sum of the Skipgram Negative Sampling Loss (SGNS) $L^{neg}\\_{ij}$ with the addition of a Dirichlet-likelihood term over document weights, $L\\_{d}$. The loss is conducted using a context vector, $\\overrightarrow{c\\_{j}}$ , pivot word vector $\\overrightarrow{w\\_{j}}$, target word vector $\\overrightarrow{w\\_{i}}$, and negatively-sampled word vector $\\overrightarrow{w\\_{l}}$:\r\n\r\n$$ L = L^{d} + \\Sigma\\_{ij}L^{neg}\\_{ij} $$\r\n\r\n$$L^{neg}\\_{ij} = \\log\\sigma\\left(c\\_{j}\\cdot\\overrightarrow{w\\_{i}}\\right) + \\sum^{n}\\_{l=0}\\sigma\\left(-\\overrightarrow{c\\_{j}}\\cdot\\overrightarrow{w\\_{l}}\\right)$$",
  "title": "Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec",
  "collection": "Word Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "Apollo",
  "full_name": "Adaptive Parameter-wise Diagonal Quasi-Newton Method",
  "description": "Please enter a description about the method here",
  "title": "Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "ProxylessNet-Mobile",
  "full_name": "ProxylessNet-Mobile",
  "description": "**ProxylessNet-Mobile** is a convolutional neural architecture learnt with the [ProxylessNAS](https://paperswithcode.com/method/proxylessnas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) algorithm that is optimized for mobile devices. It uses inverted residual blocks (MBConvs) from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2) as its basic building block.",
  "title": "ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "Adaptive Span Transformer",
  "full_name": "Adaptive Span Transformer",
  "description": "The **Adaptive Attention Span Transformer** is a Transformer that utilises an improvement to the self-attention layer called [adaptive masking](https://paperswithcode.com/method/adaptive-masking) that allows the model to choose its own context size. This results in a network where each attention layer gathers information on their own context. This allows for scaling to input sequences of more than 8k tokens.\r\n\r\nTheir proposals are based on the observation that, with the dense attention of a traditional [Transformer](https://paperswithcode.com/method/transformer), each attention head shares the same attention span $S$ (attending over the full context). But many attention heads can specialize to more local context (others look at the longer sequence). This motivates the need for a variant of self-attention that allows the model to choose its own context size (adaptive masking - see components).",
  "title": "Adaptive Attention Span in Transformers",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "MACEst",
  "full_name": "MACEst",
  "description": "**Model Agnostic Confidence Estimator**, or **MACEst**, is a model-agnostic confidence estimator. Using a set of nearest neighbours, the algorithm differs from other methods by estimating confidence independently as a local quantity which explicitly accounts for both aleatoric and epistemic uncertainty. This approach differs from standard calibration methods that use a global point prediction model as a starting point for the confidence estimate.",
  "title": "MACEst: The reliable and trustworthy Model Agnostic Confidence Estimator",
  "collection": "Confidence Estimators",
  "area": "General"
}
{
  "name": "EfficientNet",
  "full_name": "EfficientNet",
  "description": "**EfficientNet** is a convolutional neural network architecture and scaling method that uniformly scales all dimensions of depth/width/resolution using a *compound coefficient*. Unlike conventional practice that arbitrary scales  these factors, the EfficientNet scaling method uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients. For example, if we want to use $2^N$ times more computational resources, then we can simply increase the network depth by $\\alpha ^ N$,  width by $\\beta ^ N$, and image size by $\\gamma ^ N$, where $\\alpha, \\beta, \\gamma$ are constant coefficients determined by a small grid search on the original small model. EfficientNet uses a compound coefficient $\\phi$ to uniformly scales network width, depth, and resolution in a  principled way.\r\n\r\nThe compound scaling method is justified by the intuition that if the input image is bigger, then the network needs more layers to increase the receptive field and more channels to capture more fine-grained patterns on the bigger image.\r\n\r\nThe base EfficientNet-B0 network is based on the inverted bottleneck residual blocks of [MobileNetV2](https://paperswithcode.com/method/mobilenetv2), in addition to squeeze-and-excitation blocks.\r\n\r\n EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters.",
  "title": "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "End-To-End Memory Network",
  "full_name": "End-To-End Memory Network",
  "description": "An **End-to-End Memory Network** is a neural network with a recurrent attention model over a possibly large external memory. The architecture is a form of [Memory Network](https://paperswithcode.com/method/memory-network), but unlike the model in that work, it is trained end-to-end, and hence requires significantly less supervision during training. It can also be seen as an extension of RNNsearch to the case where multiple computational steps (hops) are performed per output symbol.\r\n\r\nThe model takes a discrete set of inputs $x\\_{1}, \\dots, x\\_{n}$ that are to be stored in the memory, a query $q$, and outputs an answer $a$. Each of the $x\\_{i}$, $q$, and $a$ contains symbols coming from a dictionary with $V$ words. The model writes all $x$ to the memory up to a fixed buffer size, and then finds a continuous representation for the $x$ and $q$. The continuous representation is then processed via multiple hops to output $a$.",
  "title": "End-To-End Memory Networks",
  "collection": "Working Memory Models",
  "area": "General"
}
{
  "name": "VFNet",
  "full_name": "VarifocalNet",
  "description": "**VarifocalNet** is a method aimed at accurately ranking a huge number of candidate detections in object detection. It consists of a new loss function, named [Varifocal Loss](https://paperswithcode.com/method/varifocal-loss), for training a dense object detector to predict the IACS, and a new efficient star-shaped bounding box feature representation for estimating the IACS and refining coarse bounding boxes. Combining these two new components and a bounding box refinement branch, results in a dense object detector on the [FCOS](https://paperswithcode.com/method/fcos) architecture, what the authors call VarifocalNet or VFNet for short.",
  "title": "VarifocalNet: An IoU-aware Dense Object Detector",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "Global Convolutional Network",
  "full_name": "Global Convolutional Network",
  "description": "A **Global Convolutional Network**, or **GCN**, is a semantic segmentation building block that utilizes a large kernel to help perform classification and localization tasks simultaneously. It can be used in a [FCN](https://paperswithcode.com/method/fcn)-like structure, where the [GCN](https://paperswithcode.com/method/gcn) is used to generate semantic score maps. Instead of directly using larger kernels or global [convolution](https://paperswithcode.com/method/convolution), the GCN module employs a combination of $1 \\times k + k \\times 1$ and $k \\times 1 + 1 \\times k$ convolutions, which enables [dense connections](https://paperswithcode.com/method/dense-connections) within a large\r\n$k\\times{k}$ region in the feature map",
  "title": "Large Kernel Matters -- Improve Semantic Segmentation by Global Convolutional Network",
  "collection": "Semantic Segmentation Modules",
  "area": "Computer Vision"
}
{
  "name": "EMQAP",
  "full_name": "EMQAP",
  "description": "**EMQAP**, or **E-Manual Question Answering Pipeline**, is an approach for answering questions pertaining to electronics devices. Built upon the pretrained [RoBERTa](https://paperswithcode.com/method/roberta), it harbors a supervised multi-task learning framework which efficiently performs the dual tasks of identifying the section in the E-manual where the answer can be found and the exact answer span within that section.",
  "title": "Question Answering over Electronic Devices: A New Benchmark Dataset and a Multi-Task Learning based QA Framework",
  "collection": "Question Answering Models",
  "area": "Natural Language Processing"
}
{
  "name": "Poly",
  "full_name": "Polynomial",
  "description": "",
  "title": null,
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Metrix",
  "full_name": "Metric mixup",
  "description": "A generic way of representing and interpolating labels, which allows straightforward extension of any kind of [mixup](https://paperswithcode.com/method/mixup) to deep metric learning for a large class of loss functions.",
  "title": "It Takes Two to Tango: Mixup for Deep Metric Learning",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "Anycost GAN",
  "full_name": "Anycost GAN",
  "description": "**Anycost GAN** is a type of generative adversarial network for image synthesis and editing. Given an input image, we project it into the latent space with encoder $E$ and backward optimization. We can modify the latent code with user input to edit the image. During editing, a sub-generator of small cost is used for fast and interactive preview; during idle time, the full cost generator renders the final, high-quality output. The outputs from the full and sub-generators are visually consistent during projection and editing.",
  "title": "Anycost GANs for Interactive Image Synthesis and Editing",
  "collection": "Generative Adversarial Networks",
  "area": "Computer Vision"
}
{
  "name": "DeepMask",
  "full_name": "DeepMask",
  "description": "**DeepMask** is an object proposal algorithm based on a convolutional neural network. Given an input image patch, DeepMask generates a class-agnostic mask and an associated score which estimates the likelihood of the patch fully containing a centered object (without any notion of an object category). The core of the model is a ConvNet which jointly predicts the mask and the object score. A large part of the network is shared between those two tasks: only the last few network\r\nlayers are specialized for separately outputting a mask and score prediction.",
  "title": "Learning to Segment Object Candidates via Recursive Neural Networks",
  "collection": "Region Proposal",
  "area": "Computer Vision"
}
{
  "name": "Focal Transformers",
  "full_name": "Focal Transformers",
  "description": "The **focal self-attention** is built to make Transformer layers scalable to high-resolution inputs.  Instead of attending all tokens at fine-grain, the approach attends the fine-grain tokens only locally, but the summarized ones globally. As such, it can cover as many regions as standard self-attention but with much less cost. An image is first partitioned into patches, resulting in visual tokens. Then a patch embedding layer, consisting of a convolutional layer with filter and stride of same size, to project the patches into hidden features. This spatial feature map in then passed to four stages of focal Transformer blocks. Each focal Transformer block consists of $N_i$ focal Transformer layers. Patch embedding layers are used in between to reduce spatial size of feature map by factor 2, while feature dimension increased by 2.",
  "title": "Focal Self-attention for Local-Global Interactions in Vision Transformers",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "Sparse Convolutions",
  "full_name": "Sparse Convolutions",
  "description": "",
  "title": "Spatially-sparse convolutional neural networks",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "MoCo",
  "full_name": "Momentum Contrast",
  "description": "**MoCo**, or **Momentum Contrast**, is a self-supervised learning algorithm with a contrastive loss. \r\n\r\nContrastive loss methods can be thought of as building dynamic dictionaries. The \"keys\" (tokens) in the dictionary are sampled from data (e.g., images or patches) and are represented by an encoder network. Unsupervised learning trains encoders to perform dictionary look-up: an encoded “query” should be similar to its matching key and dissimilar to others. Learning is formulated as minimizing a contrastive loss. \r\n\r\nMoCo can be viewed as a way to build large and consistent dictionaries for unsupervised learning with a contrastive loss. In MoCo, we maintain the dictionary as a queue of data samples: the encoded representations of the current mini-batch are enqueued, and the oldest are dequeued. The queue decouples the dictionary size from the mini-batch size, allowing it to be large. Moreover, as the dictionary keys come from the preceding several mini-batches, a slowly progressing key encoder, implemented as a momentum-based moving average of the query encoder, is proposed to maintain consistency.",
  "title": "Momentum Contrast for Unsupervised Visual Representation Learning",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "Teacher-Tutor-Student Knowledge Distillation",
  "full_name": "Teacher-Tutor-Student Knowledge Distillation",
  "description": "**Teacher-Tutor-Student Knowledge Distillation** is a method for image virtual try-on models. It treats fake images produced by the parser-based method as \"tutor knowledge\", where the artifacts can be corrected by real \"teacher knowledge\", which is extracted from the real person images in a self-supervised way. Other than using real images as supervisions, knowledge distillation is formulated in the try-on problem as distilling the appearance flows between the person image and the garment image, enabling the finding of dense correspondences between them to produce high-quality results.",
  "title": "Parser-Free Virtual Try-on via Distilling Appearance Flows",
  "collection": "Knowledge Distillation",
  "area": "General"
}
{
  "name": "Additive Attention",
  "full_name": "Additive Attention",
  "description": "**Additive Attention**, also known as **Bahdanau Attention**, uses a one-hidden layer feed-forward network to calculate the attention alignment score:\r\n\r\n$$f_{att}\\left(\\textbf{h}_{i}, \\textbf{s}\\_{j}\\right) = v\\_{a}^{T}\\tanh\\left(\\textbf{W}\\_{a}\\left[\\textbf{h}\\_{i};\\textbf{s}\\_{j}\\right]\\right)$$\r\n\r\nwhere $\\textbf{v}\\_{a}$ and $\\textbf{W}\\_{a}$ are learned attention parameters. Here $\\textbf{h}$ refers to the hidden states for the encoder, and $\\textbf{s}$ is the hidden states for the decoder. The function above is thus a type of alignment score function. We can use a matrix of alignment scores to show the correlation between source and target words, as the Figure to the right shows.\r\n\r\nWithin a neural network, once we have the alignment scores, we calculate the final scores using a [softmax](https://paperswithcode.com/method/softmax) function of these alignment scores (ensuring it sums to 1).",
  "title": "Neural Machine Translation by Jointly Learning to Align and Translate",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "AugMix",
  "full_name": "AugMix",
  "description": "AugMix mixes augmented images through linear interpolations. Consequently it is like [Mixup](https://paperswithcode.com/method/mixup) but instead mixes augmented versions of the same image.",
  "title": "AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "Activation Regularization",
  "full_name": "Activation Regularization",
  "description": "**Activation Regularization (AR)**, or $L\\_{2}$ activation regularization, is regularization performed on activations as opposed to weights. It is usually used in conjunction with [RNNs](https://paperswithcode.com/methods/category/recurrent-neural-networks). It is defined as:\r\n\r\n$$\\alpha{L}\\_{2}\\left(m\\circ{h\\_{t}}\\right) $$\r\n\r\nwhere $m$ is a [dropout](https://paperswithcode.com/method/dropout) mask used by later parts of the model, $L\\_{2}$ is the $L\\_{2}$ norm, and $h_{t}$ is the output of an RNN at timestep $t$, and $\\alpha$ is a scaling coefficient. \r\n\r\nWhen applied to the output of a dense layer, AR penalizes activations that are substantially away from 0, encouraging activations to remain small.",
  "title": "Revisiting Activation Regularization for Language RNNs",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "SepFormer",
  "full_name": "SepFormer",
  "description": "**SepFormer** is [Transformer](https://paperswithcode.com/methods/category/transformers)-based neural network for speech separation. The SepFormer learns short and long-term dependencies with a multi-scale approach that employs transformers. It is mainly composed of multi-head attention and feed-forward layers. A dual-path framework (introduced by DPRNN) is adopted and [RNNs](https://paperswithcode.com/methods/category/recurrent-neural-networks) are replaced with a multiscale pipeline composed of transformers that learn both short and long-term dependencies. The dual-path framework enables the mitigation of the quadratic complexity of transformers, as transformers in the dual-path framework process smaller chunks.\r\n\r\nThe model is based on the learned-domain masking approach and employs an encoder, a decoder, and a masking network, as shown in the figure. The encoder is fully convolutional, while the decoder employs two Transformers embedded inside the dual-path processing block. The decoder finally reconstructs the separated signals in the time domain by using the masks predicted by the masking network.",
  "title": "Attention is All You Need in Speech Separation",
  "collection": "Speech Separation Models",
  "area": "Audio"
}
{
  "name": "ClipBERT",
  "full_name": "ClipBERT",
  "description": "**ClipBERT** is a framework for end-to-end-learning for video-and-language tasks, by employing sparse sampling, where only a single or a few sparsely sampled short clips from a video are used at each training step. Two aspects distinguish ClipBERT from previous work. \r\n\r\nFirst, in contrast to densely extracting video features (adopted by most existing methods), CLIPBERT sparsely samples only one single or a few short clips from the full-length videos at each training step. The hypothesis is that visual features from sparse clips already capture key visual and semantic information in the video, as consecutive clips usually contain similar semantics from a continuous scene. Thus, a handful of clips are sufficient for training, instead of using the full video. Then, predictions from multiple densely-sampled clips are aggregated to obtain the final video-level prediction during inference, which is less computational demanding. \r\n\r\nThe second differentiating aspect concerns the initialization of model weights (i.e., transfer through pre-training). The authors use 2D architectures (e.g., [ResNet](https://paperswithcode.com/method/resnet)-50) instead of 3D features as the visual backbone for video encoding, allowing them to harness the power of image-text pretraining for video-text understanding along with the advantages of low memory cost and runtime efficiency.",
  "title": "Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling",
  "collection": "Generative Video Models",
  "area": "Computer Vision"
}
{
  "name": "Cosine Power Annealing",
  "full_name": "Cosine Power Annealing",
  "description": "Interpolation between [exponential decay](https://paperswithcode.com/method/exponential-decay) and [cosine annealing](https://paperswithcode.com/method/cosine-annealing).",
  "title": "sharpDARTS: Faster and More Accurate Differentiable Architecture Search",
  "collection": "Learning Rate Schedules",
  "area": "General"
}
{
  "name": "CGNN",
  "full_name": "Crystal Graph Neural Network",
  "description": "The full architecture of CGNN is presented at [CGNN's official site](https://tony-y.github.io/cgnn/architectures/).",
  "title": "Crystal Graph Neural Networks for Data Mining in Materials Science",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "Laplacian PE",
  "full_name": "Laplacian Positional Encodings",
  "description": "[Laplacian eigenvectors](https://paperswithcode.com/paper/laplacian-eigenmaps-and-spectral-techniques) represent a natural generalization of the [Transformer](https://paperswithcode.com/method/transformer) positional encodings (PE) for graphs as the eigenvectors of a discrete line (NLP graph) are the cosine and sinusoidal functions. They help encode distance-aware information (i.e., nearby nodes have similar positional features and farther nodes have dissimilar positional features).\r\n\r\nHence, Laplacian Positional Encoding (PE) is a general method to encode node positions in a graph. For each node, its Laplacian PE is the k smallest non-trivial eigenvectors.",
  "title": "Benchmarking Graph Neural Networks",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "DD-PPO",
  "full_name": "Decentralized Distributed Proximal Policy Optimization",
  "description": "**Decentralized Distributed Proximal Policy Optimization (DD-PPO)** is a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever `stale'), making it conceptually simple and easy to implement. \r\n\r\nProximal Policy Optimization, or [PPO](https://paperswithcode.com/method/ppo), is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of [TRPO](https://paperswithcode.com/method/trpo), while using only first-order optimization. \r\n\r\nLet $r\\_{t}\\left(\\theta\\right)$ denote the probability ratio $r\\_{t}\\left(\\theta\\right) = \\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}$, so $r\\left(\\theta\\_{old}\\right) = 1$. TRPO maximizes a “surrogate” objective:\r\n\r\n$$ L^{v}\\left({\\theta}\\right) = \\hat{\\mathbb{E}}\\_{t}\\left[\\frac{\\pi\\_{\\theta}\\left(a\\_{t}\\mid{s\\_{t}}\\right)}{\\pi\\_{\\theta\\_{old}}\\left(a\\_{t}\\mid{s\\_{t}}\\right)})\\hat{A}\\_{t}\\right] = \\hat{\\mathbb{E}}\\_{t}\\left[r\\_{t}\\left(\\theta\\right)\\hat{A}\\_{t}\\right] $$\r\n\r\nAs a general abstraction, DD-PPO implements the following:\r\nat step $k$, worker $n$ has a copy of the parameters, $\\theta^k_n$, calculates the gradient, $\\delta \\theta^k_n$, and updates $\\theta$ via \r\n\r\n$$ \\theta^{k+1}\\_n =  \\text{ParamUpdate}\\Big(\\theta^{k}\\_n, \\text{AllReduce}\\big(\\delta \\theta^k\\_1, \\ldots, \\delta \\theta^k\\_N\\big)\\Big) = \\text{ParamUpdate}\\Big(\\theta^{k}\\_n, \\frac{1}{N}  \\sum_{i=1}^{N} { \\delta \\theta^k_i}   \\Big) $$\r\n\r\nwhere $\\text{ParamUpdate}$ is any first-order optimization technique (e.g. gradient descent) and $\\text{AllReduce}$ performs a reduction (e.g. mean) over all copies of a variable and returns the result to all workers.\r\nDistributed DataParallel scales very well (near-linear scaling up to 32,000 GPUs), and is reasonably simple to implement (all workers synchronously running identical code).",
  "title": "DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames",
  "collection": "Distributed Reinforcement Learning",
  "area": "Reinforcement Learning"
}
{
  "name": "NetAdapt",
  "full_name": "NetAdapt",
  "description": "**NetAdapt** is a network shrinking algorithm to adapt a pretrained network to a mobile platform given a real resource budget. NetAdapt can incorporate direct metrics, such as latency and energy, into the optimization to maximize the adaptation performance based on the characteristics of the platform. By using empirical measurements, NetAdapt can be applied to any platform as long as we can measure the desired metrics, without any knowledge of the underlying implementation of the platform. \r\n\r\nWhile many existing algorithms simplify networks based on the number of MACs or weights, optimizing those indirect metrics may not necessarily reduce the direct metrics, such as latency and energy consumption. To solve this problem, NetAdapt incorporates direct metrics into its adaptation algorithm. These direct metrics are evaluated using *empirical measurements*, so that detailed knowledge of the platform and toolchain is not required. NetAdapt automatically and progressively simplifies a pre-trained network until the resource budget is met while maximizing the accuracy.",
  "title": "NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications",
  "collection": "Network Shrinking",
  "area": "General"
}
{
  "name": "E-swish",
  "full_name": "E-swish",
  "description": "",
  "title": "E-swish: Adjusting Activations to Different Network Depths",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "HyperSA",
  "full_name": "HyperGraph Self-Attention",
  "description": "An extension of Self-Attention to hypergraph\r\nSkeleton-based action recognition aims to recognize human actions given human joint coordinates with skeletal interconnections. By defining a graph with joints as vertices and their natural connections as edges, previous works successfully adopted Graph Convolutional networks (GCNs) to model joint co-occurrences and achieved superior performance. More recently, a limitation of GCNs is identified, i.e., the topology is fixed after training. To relax such a restriction, Self-Attention (SA) mechanism has been adopted to make the topology of GCNs adaptive to the input, resulting in the state-of-the-art hybrid models. Concurrently, attempts with plain Transformers have also been made, but they still lag behind state-of-the-art GCN-based methods due to the lack of structural prior. Unlike hybrid models, we propose a more elegant solution to incorporate the bone connectivity into Transformer via a graph distance embedding. Our embedding retains the information of skeletal structure during training, whereas GCNs merely use it for initialization. More importantly, we reveal an underlying issue of graph models in general, i.e., pairwise aggregation essentially ignores the high-order kinematic dependencies between body joints. To fill this gap, we propose a new self-attention (SA) mechanism on hypergraph, termed Hypergraph Self-Attention (HyperSA), to incorporate intrinsic higher-order relations into the model. We name the resulting model Hyperformer, and it beats state-of-the-art graph models w.r.t. accuracy and efficiency on NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA datasets.",
  "title": "Hypergraph Transformer for Skeleton-based Action Recognition",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Coresets",
  "full_name": "Coresets",
  "description": "",
  "title": "Active Learning for Convolutional Neural Networks: A Core-Set Approach",
  "collection": "Clustering",
  "area": "General"
}
{
  "name": "UORO",
  "full_name": "Unbiased Online Recurrent Optimization",
  "description": "",
  "title": "Unbiased Online Recurrent Optimization",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "Channel Attention Module",
  "full_name": "Channel Attention Module",
  "description": "A **Channel Attention Module** is a module for channel-based attention in convolutional neural networks. We produce a channel attention map by exploiting the inter-channel relationship of features. As each channel of a feature map is considered as a feature detector, channel attention focuses on ‘what’ is meaningful given an input image. To compute the channel attention efficiently, we squeeze the spatial dimension of the input feature map. \r\n\r\nWe first aggregate spatial information of a feature map by using both average-pooling and max-pooling operations, generating two different spatial context descriptors: $\\mathbf{F}^{c}\\_{avg}$ and $\\mathbf{F}^{c}\\_{max}$, which denote average-pooled features and max-pooled features respectively. \r\n\r\nBoth descriptors are then forwarded to a shared network to produce our channel attention map $\\mathbf{M}\\_{c} \\in \\mathbb{R}^{C\\times{1}\\times{1}}$. Here $C$ is the number of channels. The shared network is composed of multi-layer perceptron (MLP) with one hidden layer. To reduce parameter overhead, the hidden activation size is set to $\\mathbb{R}^{C/r×1×1}$, where $r$ is the reduction ratio. After the shared network is applied to each descriptor, we merge the output feature vectors using element-wise summation. In short, the channel attention is computed as:\r\n\r\n$$  \\mathbf{M\\_{c}}\\left(\\mathbf{F}\\right) = \\sigma\\left(\\text{MLP}\\left(\\text{AvgPool}\\left(\\mathbf{F}\\right)\\right)+\\text{MLP}\\left(\\text{MaxPool}\\left(\\mathbf{F}\\right)\\right)\\right) $$\r\n\r\n$$  \\mathbf{M\\_{c}}\\left(\\mathbf{F}\\right) = \\sigma\\left(\\mathbf{W\\_{1}}\\left(\\mathbf{W\\_{0}}\\left(\\mathbf{F}^{c}\\_{avg}\\right)\\right) +\\mathbf{W\\_{1}}\\left(\\mathbf{W\\_{0}}\\left(\\mathbf{F}^{c}\\_{max}\\right)\\right)\\right) $$\r\n\r\nwhere $\\sigma$ denotes the sigmoid function, $\\mathbf{W}\\_{0} \\in \\mathbb{R}^{C/r\\times{C}}$, and $\\mathbf{W}\\_{1} \\in \\mathbb{R}^{C\\times{C/r}}$. Note that the MLP weights, $\\mathbf{W}\\_{0}$ and $\\mathbf{W}\\_{1}$, are shared for both inputs and the [ReLU](https://paperswithcode.com/method/relu) activation function is followed by $\\mathbf{W}\\_{0}$.\r\n\r\nNote that the channel attention module with just [average pooling](https://paperswithcode.com/method/average-pooling) is the same as the [Squeeze-and-Excitation Module](https://paperswithcode.com/method/squeeze-and-excitation-block).",
  "title": "CBAM: Convolutional Block Attention Module",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "SRU",
  "full_name": "SRU",
  "description": "**SRU**, or **Simple Recurrent Unit**, is a recurrent neural unit with a light form of recurrence. SRU exhibits the same level of parallelism as [convolution](https://paperswithcode.com/method/convolution) and [feed-forward nets](https://paperswithcode.com/methods/category/feedforward-networks). This is achieved by balancing sequential dependence and independence: while the state computation of SRU is time-dependent, each state dimension is independent. This simplification enables CUDA-level optimizations that parallelize the computation across hidden dimensions and time steps, effectively using the full capacity of modern GPUs. \r\n\r\nSRU also replaces the use of convolutions (i.e., ngram filters), as in [QRNN](https://paperswithcode.com/method/qrnn) and KNN, with more recurrent connections. This retains modeling capacity, while using less computation (and hyper-parameters). Additionally, SRU improves the training of deep recurrent models by employing [highway connections](https://paperswithcode.com/method/highway-layer) and a parameter initialization scheme tailored for gradient propagation in deep architectures.\r\n\r\nA single layer of SRU involves the following computation:\r\n\r\n$$\r\n\\mathbf{f}\\_{t} =\\sigma\\left(\\mathbf{W}\\_{f} \\mathbf{x}\\_{t}+\\mathbf{v}\\_{f} \\odot \\mathbf{c}\\_{t-1}+\\mathbf{b}\\_{f}\\right) \r\n$$\r\n\r\n$$\r\n\\mathbf{c}\\_{t} =\\mathbf{f}\\_{t} \\odot \\mathbf{c}\\_{t-1}+\\left(1-\\mathbf{f}\\_{t}\\right) \\odot\\left(\\mathbf{W} \\mathbf{x}\\_{t}\\right) \\\\\r\n$$\r\n\r\n$$\r\n\\mathbf{r}\\_{t} =\\sigma\\left(\\mathbf{W}\\_{r} \\mathbf{x}\\_{t}+\\mathbf{v}\\_{r} \\odot \\mathbf{c}\\_{t-1}+\\mathbf{b}\\_{r}\\right) \\\\\r\n$$\r\n\r\n$$\r\n\\mathbf{h}\\_{t} =\\mathbf{r}\\_{t} \\odot \\mathbf{c}\\_{t}+\\left(1-\\mathbf{r}\\_{t}\\right) \\odot \\mathbf{x}\\_{t}\r\n$$\r\n\r\nwhere $\\mathbf{W}, \\mathbf{W}\\_{f}$ and $\\mathbf{W}\\_{r}$ are parameter matrices and $\\mathbf{v}\\_{f}, \\mathbf{v}\\_{r}, \\mathbf{b}\\_{f}$ and $\\mathbf{b}_{v}$ are parameter vectors to be learnt during training. The complete architecture decomposes to two sub-components: a light recurrence and a highway network,\r\n\r\nThe light recurrence component successively reads the input vectors $\\mathbf{x}\\_{t}$ and computes the sequence of states $\\mathbf{c}\\_{t}$ capturing sequential information. The computation resembles other recurrent networks such as [LSTM](https://paperswithcode.com/method/lstm), [GRU](https://paperswithcode.com/method/gru) and RAN. Specifically, a forget gate $\\mathbf{f}\\_{t}$ controls the information flow and the state vector $\\mathbf{c}\\_{t}$ is determined by adaptively averaging the previous state $\\mathbf{c}\\_{t-1}$ and the current observation $\\mathbf{W} \\mathbf{x}_{+}$according to $\\mathbf{f}\\_{t}$.",
  "title": "Simple Recurrent Units for Highly Parallelizable Recurrence",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "KPE",
  "full_name": "Keypoint Pose Encoding",
  "description": "",
  "title": "KPE: Keypoint Pose Encoding for Transformer-based Image Generation",
  "collection": "Pose Estimation Blocks",
  "area": "Computer Vision"
}
{
  "name": "Blue River Controls",
  "full_name": "Blue River Controls",
  "description": "**Blue River Controls** is a tool that allows users to train and test reinforcement learning algorithms on real-world hardware. It features a simple interface based on OpenAI Gym, that works directly on both simulation and hardware.",
  "title": "Blue River Controls: A toolkit for Reinforcement Learning Control Systems on Hardware",
  "collection": "Reinforcement Learning Frameworks",
  "area": "Reinforcement Learning"
}
{
  "name": "TILDEv2",
  "full_name": "TILDEv2",
  "description": "**TILDEv2** is a [BERT](https://paperswithcode.com/method/bert)-based re-ranking method that stems from [TILDE](https://dl.acm.org/doi/abs/10.1145/3404835.3462922) but that addresses its limitations. It relies on contextualized exact term matching with expanded passages. This requires to only store in the index the score of tokens that appear in the expanded passages (rather than all the vocabulary), thus producing indexes that are 99% smaller than those of the original.\r\n\r\nSpecifically, TILDE is modified in the following aspects:\r\n\r\n- **Exact Term Matching**. The query likelihood matching originally employed in TILDE, expands passages into the BERT vocabulary size, resulting in large indexes. To overcome this issue, estimating relevance scores is achieved with contextualized exact term matching. This allows the model to index tokens only present in the passage, thus reducing the index size. In addition to this, we replace the query likelihood loss function, with the Noise contrastive estimation (NCE) loss that allows to better leverage negative training samples. \r\n \r\n- **Passage Expansion**. To overcome the vocabulary mismatch problem that affects exact term matching methods, passage expansion is used to expand the original passage collection. Passages in the collection are expanded using deep LMs with a limited number of tokens. This requires TILDEv2 to only index a few extra tokens in addition to those in the original passages.",
  "title": "Fast Passage Re-ranking with Contextualized Exact Term Matching and Efficient Passage Expansion",
  "collection": "Passage Re-Ranking Models",
  "area": "Natural Language Processing"
}
{
  "name": "4D A*",
  "full_name": "Four-dimensional A-star",
  "description": "The aim of 4D A* is to find the shortest path between two four-dimensional (4D) nodes of a 4D search space - a starting node and a target node - as long as there is a path. It achieves both optimality and completeness. The former is because the path is shortest possible, and the latter because if the solution exists the algorithm is guaranteed to find it.",
  "title": "Artificial Intelligence Control in 4D Cylindrical Space for Industrial Robotic Applications",
  "collection": "Heuristic Search Algorithms",
  "area": "Reinforcement Learning"
}
{
  "name": "MXMNet",
  "full_name": "Multiplex Molecular Graph Neural Network",
  "description": "The **Multiplex Molecular Graph Neural Network (MXMNet)** is an approach for the representation learning of molecules. The molecular interactions are divided into two categories: local and global. Then a two-layer multiplex graph $G = \\\\{ G_{l}, G_{g} \\\\}$ is constructed for a molecule. In $G$, the local layer $G_{l}$ only contains the local connections that mainly capture covalent interactions, and the global layer $G_{g}$ contains the global connections that cover non-covalent interactions. MXMNet uses the Multiplex Molecular (MXM) module that contains a novel angle-aware message passing operated on $G_{l}$ and an efficient message passing operated on $G_{g}$.",
  "title": "Molecular Mechanics-Driven Graph Neural Network with Multiplex Graph for Molecular Structures",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "SEAM",
  "full_name": "Self-supervised Equivariant Attention Mechanism",
  "description": "**Self-supervised Equivariant Attention Mechanism**, or **SEAM**, is an attention mechanism for weakly supervised semantic segmentation. The SEAM applies consistency regularization on CAMs from various transformed images to provide self-supervision for network learning. To further improve the network prediction consistency, SEAM introduces the pixel correlation module (PCM), which captures context appearance information for each pixel and revises original CAMs by learned affinity attention maps. The SEAM is implemented by a [siamese network](https://paperswithcode.com/method/siamese-network) with equivariant cross regularization (ECR) loss, which regularizes the original CAMs and the revised CAMs on different branches.",
  "title": "Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "cVAE",
  "full_name": "Conditional Variational Auto Encoder",
  "description": "",
  "title": "Learning Structured Output Representation using Deep Conditional Generative Models",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "LayoutLMv2",
  "full_name": "LayoutLMv2",
  "description": "**LayoutLMv2** is an architecture and pre-training method for document understanding. The model is pre-trained with a great number of unlabeled scanned document images from the IIT-CDIP dataset, where some images in the text-image pairs are randomly replaced with another document image to make the model learn whether the image and OCR texts are correlated or not. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture, so that the model can fully understand the relative positional relationship among different text blocks.\r\n\r\nSpecifically, an enhanced Transformer architecture is used, i.e. a multi-modal Transformer asisthe backbone of LayoutLMv2. The multi-modal Transformer accepts inputs of three modalities: text, image, and layout. The input of each modality is converted to an embedding sequence and fused by the encoder. The model establishes deep interactions within and between modalities by leveraging the powerful Transformer layers.",
  "title": "LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding",
  "collection": "Document Understanding Models",
  "area": "Natural Language Processing"
}
{
  "name": "Colorization Transformer",
  "full_name": "Colorization Transformer",
  "description": "**Colorization Transformer** is a probabilistic [colorization](https://paperswithcode.com/method/colorization) model composed only of [axial self-attention blocks](https://paperswithcode.com/method/axial). The main advantages of these blocks are the ability to capture a global receptive field with only two layers and $\\mathcal{O}(D\\sqrt{D})$ instead of $\\text{O}(D^{2})$ complexity. In order to enable colorization of high-resolution grayscale images, the task is decomposed into three simpler sequential subtasks: coarse low resolution autoregressive colorization, parallel color and spatial super-resolution.\r\n\r\nFor coarse low resolution colorization, a conditional variant of [Axial Transformer](https://paperswithcode.com/method/axial) is applied. The authors leverage the semi-parallel sampling mechanism of Axial Transformers. Finally, fast parallel deterministic upsampling models are employed to super-resolve the coarsely colorized image into the final high resolution output.",
  "title": "Colorization Transformer",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "Synthesizer",
  "full_name": "Synthesizer",
  "description": "The  **Synthesizer** is a model that learns synthetic attention weights without token-token interactions. Unlike [Transformers](https://paperswithcode.com/method/transformer), the model eschews dot product self-attention but also content-based self-attention altogether. Synthesizer learns to synthesize the self-alignment matrix instead of manually computing pairwise dot products. It is transformation-based, only relies on simple feed-forward layers, and completely dispenses with dot products and explicit token-token interactions. \r\n\r\nThis new module employed by the Synthesizer is called \"Synthetic Attention\": a new way of learning to attend without explicitly attending (i.e., without dot product attention or [content-based attention](https://paperswithcode.com/method/content-based-attention)). Instead, Synthesizer generate the alignment matrix independent of token-token dependencies.",
  "title": "Synthesizer: Rethinking Self-Attention in Transformer Models",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "PFPNet",
  "full_name": "Parallel Feature Pyramid Network",
  "description": "",
  "title": "Parallel Feature Pyramid Network for Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "Cascade Mask R-CNN",
  "full_name": "Cascade Mask R-CNN",
  "description": "**Cascade Mask R-CNN** extends [Cascade R-CNN](https://paperswithcode.com/method/cascade-r-cnn) to instance segmentation, by adding a\r\nmask head to the cascade.\r\n\r\nIn the [Mask R-CNN](https://paperswithcode.com/method/mask-r-cnn), the segmentation branch is inserted in parallel to the detection branch. However, the Cascade [R-CNN](https://paperswithcode.com/method/r-cnn) has multiple detection branches. This raises the questions of 1) where to add the segmentation branch and 2) how many segmentation branches to add. The authors consider three strategies for mask prediction in the Cascade R-CNN. The first two strategies address the first question, adding a single mask prediction head at either the first or last stage of the Cascade R-CNN. Since the instances used to train the segmentation branch are the positives of the detection branch, their number varies in these two strategies. Placing the segmentation head later on the cascade leads to more examples. However, because segmentation is a pixel-wise operation, a large number of highly overlapping instances is not necessarily as helpful as for object detection, which is a patch-based operation. The third strategy addresses the second question, adding a segmentation branch to each\r\ncascade stage. This maximizes the diversity of samples used to learn the mask prediction task. \r\n\r\nAt inference time, all three strategies predict the segmentation masks on the patches produced by the final object detection stage, irrespective of the cascade stage on which the segmentation mask is implemented and how many segmentation branches there are.",
  "title": "Cascade R-CNN: Delving into High Quality Object Detection",
  "collection": "Instance Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "VL-T5",
  "full_name": "VL-T5",
  "description": "VL-T5 is a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation. The model learns to generate labels in text based on the visual and textual inputs. In contrast to other existing methods, the framework unifies tasks as generating text labels conditioned on multimodal inputs. This allows the model to tackle vision-and-language tasks with unified text generation objective. The models use text prefixes to adapt to different tasks.",
  "title": "Unifying Vision-and-Language Tasks via Text Generation",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "AdaRNN",
  "full_name": "AdaRNN",
  "description": "**AdaRNN** is an adaptive [RNN](https://paperswithcode.com/methods/category/recurrent-neural-networks) that learns an adaptive model through two modules: [Temporal Distribution Characterization](https://paperswithcode.com/method/temporal-distribution-characterization) (TDC) and [Temporal Distribution Matching](https://paperswithcode.com/method/temporal-distribution-matching) (TDM) algorithms. Firstly, to better characterize the distribution information in time-series, TDC splits the training data into $K$ most diverse periods that have a large distribution gap inspired by the principle of maximum entropy. After that, a temporal distribution matching (TDM) algorithm is used to dynamically reduce distribution divergence using a [RNN](https://paperswithcode.com/methods/category/recurrent-neural-networks)-based model.",
  "title": "AdaRNN: Adaptive Learning and Forecasting of Time Series",
  "collection": "Recurrent Neural Networks",
  "area": "Sequential"
}
{
  "name": "TinyNet",
  "full_name": "Model Rubik's Cube: Twisting Resolution, Depth and Width for TinyNets",
  "description": "To obtain excellent deep neural architectures, a series of techniques are carefully designed in EfficientNets. The giant formula for simultaneously enlarging the resolution, depth and width provides us a Rubik's cube for neural networks. So that we can find networks with high efficiency and excellent performance by twisting the three dimensions. This paper aims to explore the twisting rules for obtaining deep neural networks with minimum model sizes and computational costs. Different from the network enlarging, we observe that resolution and depth are more important than width for tiny networks. Therefore, the original method, i.e., the compound scaling in [EfficientNet](https://paperswithcode.com/method/efficientnet) is no longer suitable. To this end, we summarize a tiny formula for downsizing neural architectures through a series of smaller models derived from the EfficientNet-B0 with the FLOPs constraint. Experimental results on the ImageNet benchmark illustrate that our TinyNet performs much better than the smaller version of EfficientNets using the inversed giant formula. For instance, our TinyNet-E achieves a 59.9% Top-1 accuracy with only 24M FLOPs, which is about 1.9% higher than that of the previous best [MobileNetV3](https://paperswithcode.com/method/mobilenetv3) with similar computational cost.",
  "title": "Model Rubik's Cube: Twisting Resolution, Depth and Width for TinyNets",
  "collection": "Network Shrinking",
  "area": "General"
}
{
  "name": "ORN",
  "full_name": "Orientation Regularized Network",
  "description": "**Orientation Regularized Network** (ORN) is a multi-view image fusion technique for pose estimation. It uses IMU orientations as a structural prior to mutually fuse the image features of each pair of joints linked by IMUs. For example, it uses the features of the elbow to reinforce those of the wrist based on the IMU at the lower-arm.",
  "title": "Fusing Wearable IMUs with Multi-View Images for Human Pose Estimation: A Geometric Approach",
  "collection": "Pose Estimation Blocks",
  "area": "Computer Vision"
}
{
  "name": "PointASNL",
  "full_name": "PointASNL",
  "description": "**PointASNL** is a non-local neural network for point clouds processing It consists of two general modules: adaptive sampling (AS) module and local-Nonlocal (L-NL) module. The AS module first re-weights the neighbors around the initial sampled points from farthest point sampling (FPS), and then adaptively adjusts the sampled points beyond the entire point cloud. The AS module can not only benefit the feature learning of point clouds, but also ease the biased effect of outliers. The L-NL module capture the neighbor and long-range dependencies of the sampled point, and enables the learning process to be insensitive to noise.",
  "title": "PointASNL: Robust Point Clouds Processing using Nonlocal Neural Networks with Adaptive Sampling",
  "collection": "Point Cloud Models",
  "area": "Computer Vision"
}
{
  "name": "MobileBERT",
  "full_name": "MobileBERT",
  "description": "**MobileBERT** is a type of inverted-bottleneck [BERT](https://paperswithcode.com/method/bert) that compresses and accelerates the popular BERT model. MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERT_LARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to various downstream NLP tasks via simple fine-tuning. It is trained by layer-to-layer imitating the inverted bottleneck BERT.",
  "title": "MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Panoptic FPN",
  "full_name": "Panoptic FPN",
  "description": "A **Panoptic FPN** is an extension of an [FPN](https://paperswithcode.com/method/fpn) that can generate both instance and semantic segmentations via FPN. The approach starts with an FPN backbone and adds a branch for performing semantic segmentation in parallel with the existing region-based branch for instance segmentation. No changes are made to the FPN backbone when adding the dense-prediction branch, making it compatible with existing instance segmentation methods. \r\n\r\nThe new semantic segmentation branch achieves its goal as follows. Starting from the deepest FPN level (at 1/32 scale), we perform three upsampling stages to yield a feature map at 1/4 scale, where each upsampling stage consists of 3×3 [convolution](https://paperswithcode.com/method/convolution), group norm, [ReLU](https://paperswithcode.com/method/relu), and 2× bilinear upsampling. This strategy is repeated for FPN scales 1/16, 1/8, and 1/4 (with progressively fewer upsampling stages). The result is a set of feature maps at the same 1/4 scale, which are then element-wise summed. A final 1×1 convolution, 4× bilinear upsampling, and [softmax](https://paperswithcode.com/method/softmax) are used to generate the per-pixel class labels at the original image resolution. In addition to stuff classes, this branch also outputs a special ‘other’ class for all pixels belonging to objects (to avoid predicting stuff classes for such pixels).",
  "title": "Panoptic Feature Pyramid Networks",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "HyperTree MetaModel",
  "full_name": "HyperTree MetaModel",
  "description": "Optimize combinations of various neural network models for multimodal data with bayseian optimization.",
  "title": "The CoSTAR Block Stacking Dataset: Learning with Workspace Constraints",
  "collection": "Neural Architecture Search",
  "area": "General"
}
{
  "name": "GPSA",
  "full_name": "Gated Positional Self-Attention",
  "description": "**Gated Positional Self-Attention (GPSA)** is a self-attention module for vision transformers, used in the [ConViT](https://paperswithcode.com/method/convit) architecture, that can be initialized as a convolutional layer -- helping a ViT learn inductive biases about locality.",
  "title": "ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "Polyak Averaging",
  "full_name": "Polyak Averaging",
  "description": "**Polyak Averaging** is an optimization technique that sets final parameters to an average of (recent) parameters visited in the optimization trajectory. Specifically if in $t$ iterations we have parameters $\\theta\\_{1}, \\theta\\_{2}, \\dots, \\theta\\_{t}$, then Polyak Averaging suggests setting \r\n\r\n$$ \\theta\\_t =\\frac{1}{t}\\sum\\_{i}\\theta\\_{i} $$\r\n\r\nImage Credit: [Shubhendu Trivedi & Risi Kondor](https://ttic.uchicago.edu/~shubhendu/Pages/Files/Lecture6_flat.pdf)",
  "title": null,
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "TNT",
  "full_name": "Transformer in Transformer",
  "description": "[Transformer](https://paperswithcode.com/method/transformer) is a type of self-attention-based neural networks originally applied for NLP tasks. Recently, pure transformer-based models are proposed to solve computer vision problems. These visual transformers usually view an image as a sequence of patches while they ignore the intrinsic structure information inside each patch. In this paper, we propose a novel Transformer-iN-Transformer (TNT) model for modeling both patch-level and pixel-level representation. In each TNT block, an outer transformer block is utilized to process patch embeddings, and an inner transformer block extracts local features from pixel embeddings. The pixel-level feature is projected to the space of patch embedding by a linear transformation layer and then added into the patch. By stacking the TNT blocks, we build the TNT model for image recognition.\r\n\r\nImage source: [Han et al.](https://arxiv.org/pdf/2103.00112v1.pdf)",
  "title": "Transformer in Transformer",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Meena",
  "full_name": "Meena",
  "description": "**Meena** is a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. A seq2seq model is used with the Evolved [Transformer](https://paperswithcode.com/method/transformer) as the main architecture. The model is trained on multi-turn conversations where the input sequence is all turns of the context and the output sequence is the response.",
  "title": "Towards a Human-like Open-Domain Chatbot",
  "collection": "Conversational Models",
  "area": "Natural Language Processing"
}
{
  "name": "Mixture of Logistic Distributions",
  "full_name": "Mixture of Logistic Distributions",
  "description": "**Mixture of Logistic Distributions (MoL)** is a type of output function, and an alternative to a [softmax](https://paperswithcode.com/method/softmax) layer. Discretized logistic mixture likelihood is used in [PixelCNN](https://paperswithcode.com/method/pixelcnn)++ and [WaveNet](https://paperswithcode.com/method/wavenet) to predict discrete values.\r\n\r\nImage Credit: [Hao Gao](https://medium.com/@smallfishbigsea/an-explanation-of-discretized-logistic-mixture-likelihood-bdfe531751f0)",
  "title": null,
  "collection": "Output Functions",
  "area": "General"
}
{
  "name": "Inception-v3",
  "full_name": "Inception-v3",
  "description": "**Inception-v3** is a convolutional neural network architecture from the Inception family that makes several improvements including using [Label Smoothing](https://paperswithcode.com/method/label-smoothing), Factorized 7 x 7 convolutions, and the use of an auxiliary classifer to propagate label information lower down the network (along with the use of [batch normalization](https://paperswithcode.com/method/batch-normalization) for layers in the sidehead).",
  "title": "Rethinking the Inception Architecture for Computer Vision",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "GreedyNAS-B",
  "full_name": "GreedyNAS-B",
  "description": "**GreedyNAS-B** is a convolutional neural network discovered using the [GreedyNAS](https://paperswithcode.com/method/greedynas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) method. The basic building blocks used are inverted residual blocks (from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2)) and squeeze-and-excitation blocks.",
  "title": "GreedyNAS: Towards Fast One-Shot NAS with Greedy Supernet",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Switch Transformer",
  "full_name": "Switch Transformer",
  "description": "**Switch Transformer** is a sparsely-activated expert [Transformer](https://paperswithcode.com/methods/category/transformers) model that aims to simplify and improve over Mixture of Experts. Through distillation of sparse pre-trained and specialized fine-tuned models into small dense models, it reduces the model size by up to 99% while preserving 30% of the quality gains of the large sparse teacher. It also uses selective precision training that enables training with lower bfloat16 precision, as well as an initialization scheme that allows for scaling to a larger number of experts, and also increased regularization that improves sparse model fine-tuning and multi-task training.",
  "title": "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "CR-NET",
  "full_name": "CR-NET",
  "description": "CR-NET is a YOLO-based model proposed for license plate character detection and recognition",
  "title": "License Plate Detection and Recognition in Unconstrained Scenarios",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "imagemorph",
  "full_name": "Random elastic image morphing",
  "description": "M. Bulacu, A. Brink, T. v. d. Zant and L. Schomaker, \"Recognition of Handwritten Numerical Fields in a Large Single-Writer Historical Collection,\" 2009 10th International Conference on Document Analysis and Recognition, Barcelona, Spain, 2009, pp. 808-812, doi: 10.1109/ICDAR.2009.8.\r\n\r\nCode: https://github.com/GrHound/imagemorph.c\r\n\r\nIn contrast with the EM algorithm (Baum-Welch) for HMMs, training the basic character recognizer for a segmentation-based handwriting recognition system is a tricky issue without a standard solution. Our approach was to collect a labeled base set of digit images segmented by hand and then to augment this data by generating synthetic examples using random geometric distortions. We were incited by the record performance in digit recognition reported in Simard et al. (2003) but developed our own algorithm for this purpose.\r\nFor every pixel (i,j) of the template image, a random displacement vector ($\\Delta x,\\Delta y$) is generated. The displacement field of the complete image is smoothed using a Gaussian convolution kernel with standard deviation $\\sigma$. The field is finally rescaled to an average amplitude A. The new morphed image (i',j') is generated using the displacement field and bilinear interpolation i'=i+$\\Delta$x,j'=j+$\\Delta$y. This morphing process is controlled by two parameters: the smoothing radius r and the average pixel displacement D. Both parameters are measured in units of pixels.\r\n\r\nAn intuitive interpretation is to imagine that the characters are written on a rubber sheet and we apply non-uniform random local distortions, contracting one part, while maybe expanding another part of the character (see Fig. 5). This random elastic morphing is more general than affine transforms, providing a rich ensemble of shape variations. We applied it to our base set of labeled digits (~130 samples per class) to obtain a much expanded training dataset (from 1 up to 80 times). The expansion factor f controls the amount of synthetic data: for every base example, f - 1 additional morphed patterns are generated and used in training.\r\n\r\nThis is a cheap method relying on random numbers and basic computer graphics. In this way, a virtually infinite volume of training samples can be fabricated. This stratagem is very successful and does not increase the load at recognition time for parametric classifiers. Essentially, we tum the tables around and, instead of trying to recognize a character garbled in an unpredictable way by the writer in the instantaneous act of handwriting, we generate the deformations ourselves, while training a neural network to become immune to such distortions.\r\n\r\nThe accompanying image, a crop of an RGB page scan, containing the cursive handwritten word 'Zwolle'\r\nwas morphed a number of times, with parameters dist=1.5, radius=8.5\r\nThis distortion is sufficient to introduce a believable variance in the appearance.\r\n\r\nimagemorph 1.5 8.5 < Zwolle.ppm > Zwolle-morphed.ppm\r\n\r\nNetpbm image format is common in many CV tools. You can use ImageMagick's convert\r\nor other tools to convert to/fro .ppm\r\n\r\n\r\nAlso see: \r\nP. Simard, D. Steinkraus, and J. Platt. Best practices for convolutional neural networks applied to visual document analysis. In Proc. of 7th ICDAR, pp 958-962, Edinburgh, Scotland, 2003.",
  "title": null,
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "LIME",
  "full_name": "Local Interpretable Model-Agnostic Explanations",
  "description": "**LIME**, or **Local Interpretable Model-Agnostic Explanations**, is an algorithm that can explain the predictions of any classifier or regressor in a faithful way, by approximating it locally with an interpretable model. It modifies a single data sample by tweaking the feature values and observes the resulting impact on the output. It performs the role of an \"explainer\" to explain predictions from each data sample. The output of LIME is a set of explanations representing the contribution of each feature to a prediction for a single sample, which is a form of local interpretability.\r\n\r\nInterpretable models in LIME can be, for instance, [linear regression](https://paperswithcode.com/method/linear-regression) or decision trees, which are trained on small perturbations (e.g. adding noise, removing words, hiding parts of the image) of the original model to provide a good local approximation.",
  "title": "\"Why Should I Trust You?\": Explaining the Predictions of Any Classifier",
  "collection": "Interpretability",
  "area": "General"
}
{
  "name": "Cosine Annealing",
  "full_name": "Cosine Annealing",
  "description": "**Cosine Annealing** is a type of learning rate schedule that has the effect of starting with a large learning rate that is relatively rapidly decreased to a minimum value before being increased rapidly again. The resetting of the learning rate acts like a simulated restart of the learning process and the re-use of good weights as the starting point of the restart is referred to as a \"warm restart\" in contrast to a \"cold restart\" where a new set of small random numbers may be used as a starting point.\r\n\r\n$$\\eta\\_{t} = \\eta\\_{min}^{i} + \\frac{1}{2}\\left(\\eta\\_{max}^{i}-\\eta\\_{min}^{i}\\right)\\left(1+\\cos\\left(\\frac{T\\_{cur}}{T\\_{i}}\\pi\\right)\\right)\r\n$$\r\n\r\nWhere where $\\eta\\_{min}^{i}$ and $ \\eta\\_{max}^{i}$ are ranges for the learning rate, and $T\\_{cur}$ account for how many epochs have been performed since the last restart.\r\n\r\nText Source: [Jason Brownlee](https://machinelearningmastery.com/snapshot-ensemble-deep-learning-neural-network/)\r\n\r\nImage Source: [Gao Huang](https://www.researchgate.net/figure/Training-loss-of-100-layer-DenseNet-on-CIFAR10-using-standard-learning-rate-blue-and-M_fig2_315765130)",
  "title": "SGDR: Stochastic Gradient Descent with Warm Restarts",
  "collection": "Learning Rate Schedules",
  "area": "General"
}
{
  "name": "Set Transformer",
  "full_name": "Set Transformer",
  "description": "Many machine learning tasks such as multiple instance learning, 3D shape recognition, and few-shot image classification are defined on sets of instances. Since solutions to such problems do not depend on the order of elements of the set, models used to address them should be permutation invariant. We present an attention-based neural network module, the Set Transformer, specifically designed to model interactions among elements in the input set. The model consists of an encoder and a decoder, both of which rely on attention mechanisms. In an effort to reduce computational complexity, we introduce an attention scheme inspired by inducing point methods from sparse Gaussian process literature. It reduces the computation time of self-attention from quadratic to linear in the number of elements in the set. We show that our model is theoretically attractive and we evaluate it on a range of tasks, demonstrating the state-of-the-art performance compared to recent methods for set-structured data.",
  "title": "Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "CDIL-CNN",
  "full_name": "Circular Dilated Convolutional Neural Networks",
  "description": "",
  "title": "Classification of Long Sequential Data using Circular Dilated Convolutional Neural Networks",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Visual Parsing",
  "full_name": "Visual Parsing",
  "description": "Visual Parsing is a vision and language pretrained model that adopts self-attention for visual feature learning where each visual token is an approximate weighted mixture of all tokens. Thus, visual parsing provides the dependencies of each visual token pair.  It helps better learning of visual relation with the language and promote inter modal alignment. The model is composed of a vision Transformer that takes an image as input and outputs the visual tokens and a multimodal Transformer. \r\nIt applies a linear layer and a Layer Normalization to embed the vision tokens. It follows BERT to get word embeddings. Vision and language tokens are concatenated to form the input sequences. A multi-modal Transformer is used to fuse the vision and language modality. A metric named Inter-Modality Flow (IMF) is used to quantify the interactions between two modalities.\r\nThree pretraining tasks are adopted: Masked Language Modeling (MLM), Image-Text Matching (ITM), and Masked Feature Regression (MFR). MFR is a novel task that is included to mask visual tokens with similar or correlated semantics in this framework.",
  "title": "Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "Batchboost",
  "full_name": "Batchboost",
  "description": "**Batchboost** is a variation on [MixUp](https://paperswithcode.com/method/mixup) that instead of mixing just two images, mixes many images together.",
  "title": "batchboost: regularization for stabilizing training with resistance to underfitting & overfitting",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "Spatial Pyramid Pooling",
  "full_name": "Spatial Pyramid Pooling",
  "description": "** Spatial Pyramid Pooling (SPP)** is a pooling layer that removes the fixed-size constraint of the network, i.e. a CNN does not require a fixed-size input image. Specifically, we add an SPP layer on top of the last convolutional layer. The SPP layer pools the features and generates fixed-length outputs, which are then fed into the fully-connected layers (or other classifiers). In other words, we perform some information aggregation at a deeper stage of the network hierarchy (between convolutional layers and fully-connected layers) to avoid the need for cropping or warping at the beginning.",
  "title": "Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition",
  "collection": "Pooling Operations",
  "area": "Computer Vision"
}
{
  "name": "Sym-NCO",
  "full_name": "Sym-NCO",
  "description": "",
  "title": "Sym-NCO: Leveraging Symmetricity for Neural Combinatorial Optimization",
  "collection": "Reinforcement Learning Frameworks",
  "area": "Reinforcement Learning"
}
{
  "name": "Latent Optimisation",
  "full_name": "Latent Optimisation",
  "description": "**Latent Optimisation** is a technique used for generative adversarial networks to refine the sample quality of $z$. Specifically, it exploits knowledge from the discriminator $D$ to refine the latent source $z$. Intuitively, the gradient $\\nabla\\_{z}f\\left(z\\right) = \\delta{f}\\left(z\\right)\\delta{z}$ points in the direction that better satisfies the discriminator $D$, which implies better samples. Therefore, instead of using the randomly sampled $z \\sim p\\left(z\\right)$, we uses the optimised latent:\r\n\r\n$$ \\Delta{z} = \\alpha\\frac{\\delta{f}\\left(z\\right)}{\\delta{z}} $$\r\n\r\n$$ z' = z + \\Delta{z} $$\r\n\r\nSource: [LOGAN](https://paperswithcode.com/method/logan)\r\n.",
  "title": "Deep Compressed Sensing",
  "collection": "Latent Variable Sampling",
  "area": "General"
}
{
  "name": "Focus",
  "full_name": "Focus",
  "description": "",
  "title": "Focus Your Attention (with Adaptive IIR Filters)",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "k-NN",
  "full_name": "k-Nearest Neighbors",
  "description": "**$k$-Nearest Neighbors** is a clustering-based algorithm for classification and regression. It is a a type of instance-based learning as it does not attempt to construct a general internal model, but simply stores instances of the training data. Prediction is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.\r\n\r\nSource of Description and Image: [scikit-learn](https://scikit-learn.org/stable/modules/neighbors.html#classification)",
  "title": null,
  "collection": "Non-Parametric Classification",
  "area": "General"
}
{
  "name": "iGCL",
  "full_name": "Implicit Graph Contrastive Learning",
  "description": "Please enter a description about the method here",
  "title": "Graph Contrastive Learning with Implicit Augmentations",
  "collection": "Graph Representation Learning",
  "area": "Graphs"
}
{
  "name": "WaveGrad UBlock",
  "full_name": "WaveGrad UBlock",
  "description": "The **WaveGrad UBlock** is used for upsampling in [WaveGrad](https://paperswithcode.com/method/wavegrad). Neural audio generation models often use large receptive field. Dilation factors of four convolutional layers are 1, 2, 1, 2 for the first two UBlocks and 1, 2, 4, 8 for the rest. Orthogonal initialization is used.",
  "title": "WaveGrad: Estimating Gradients for Waveform Generation",
  "collection": "Audio Model Blocks",
  "area": "Audio"
}
{
  "name": "Batch Normalization",
  "full_name": "Batch Normalization",
  "description": "**Batch Normalization** aims to reduce internal covariate shift, and in doing so aims to accelerate the training of deep neural nets. It accomplishes this via a normalization step that fixes the means and variances of layer inputs. Batch Normalization also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or of their initial values. This allows for use of much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for [Dropout](https://paperswithcode.com/method/dropout).\r\n\r\nWe apply a batch normalization layer as follows for a minibatch $\\mathcal{B}$:\r\n\r\n$$ \\mu\\_{\\mathcal{B}} = \\frac{1}{m}\\sum^{m}\\_{i=1}x\\_{i} $$\r\n\r\n$$ \\sigma^{2}\\_{\\mathcal{B}} = \\frac{1}{m}\\sum^{m}\\_{i=1}\\left(x\\_{i}-\\mu\\_{\\mathcal{B}}\\right)^{2} $$\r\n\r\n$$ \\hat{x}\\_{i} = \\frac{x\\_{i} - \\mu\\_{\\mathcal{B}}}{\\sqrt{\\sigma^{2}\\_{\\mathcal{B}}+\\epsilon}} $$\r\n\r\n$$ y\\_{i} = \\gamma\\hat{x}\\_{i} + \\beta = \\text{BN}\\_{\\gamma, \\beta}\\left(x\\_{i}\\right) $$\r\n\r\nWhere $\\gamma$ and $\\beta$ are learnable parameters.",
  "title": "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "GLM",
  "full_name": "GLM",
  "description": "**GLM** is a bilingual (English and Chinese) pre-trained transformer-based language model that follow the traditional architecture of decoder-only autoregressive language modeling. It leverages autoregressive blank infilling as its training objective.",
  "title": "GLM-130B: An Open Bilingual Pre-trained Model",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "SSFG regularization",
  "full_name": "Stochastically Scaling Features and Gradients Regularization",
  "description": "Please enter a description about the method here",
  "title": "SSFG: Stochastically Scaling Features and Gradients for Regularizing Graph Convolutional Networks",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "EmbraceNet",
  "full_name": "EmbraceNet: A robust deep learning architecture for multimodal classification",
  "description": "",
  "title": "EmbraceNet: A robust deep learning architecture for multimodal classification",
  "collection": "Multi-Modal Methods",
  "area": "Computer Vision"
}
{
  "name": "VQ-VAE-2",
  "full_name": "VQ-VAE-2",
  "description": "**VQ-VAE-2** is a type of variational autoencoder that combines a a two-level hierarchical VQ-[VAE](https://paperswithcode.com/method/vae) with a self-attention autoregressive model ([PixelCNN](https://paperswithcode.com/method/pixelcnn)) as a prior. The encoder and decoder architectures are kept simple and light-weight as in the original [VQ-VAE](https://paperswithcode.com/method/vq-vae), with the only difference that hierarchical multi-scale latent maps are used for increased resolution.",
  "title": "Generating Diverse High-Fidelity Images with VQ-VAE-2",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "ConViT",
  "full_name": "ConViT",
  "description": "**ConViT** is a type of [vision transformer](https://paperswithcode.com/method/vision-transformer) that uses a gated positional self-attention module ([GPSA](https://paperswithcode.com/method/gpsa)), a form of positional self-attention which can be equipped with a “soft” convolutional inductive bias. The GPSA layers are initialized to mimic the locality of convolutional layers, then each attention head is given the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information.",
  "title": "ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "MonoPort",
  "full_name": "Monocular Real-Time Volumetric Performance Capture",
  "description": "",
  "title": "Monocular Real-Time Volumetric Performance Capture",
  "collection": "3D Reconstruction",
  "area": "Computer Vision"
}
{
  "name": "DeLighT",
  "full_name": "DeLighT",
  "description": "**DeLiGHT** is a [transformer](https://paperswithcode.com/method/transformer) architecture that delivers parameter efficiency improvements by (1) within each Transformer block using [DExTra](https://paperswithcode.com/method/dextra), a deep and light-weight transformation, allowing for the use of [single-headed attention](https://paperswithcode.com/method/single-headed-attention) and bottleneck FFN layers and (2) across blocks using block-wise scaling, that allows for shallower and narrower [DeLighT blocks](https://paperswithcode.com/method/delight-block) near the input and wider and deeper DeLighT blocks near the output.",
  "title": "DeLighT: Deep and Light-weight Transformer",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "SIG",
  "full_name": "Sliced Iterative Generator",
  "description": "The **Sliced Iterative Generator (SIG)** is an iterative generative model that is a Normalizing Flow (NF), but shares the advantages of Generative Adversarial Networks (GANs). The model is based on iterative Optimal Transport of a series of 1D slices through the data space, matching on each slice the probability distribution function (PDF) of the samples to the data. To improve the efficiency, the directions of the orthogonal slices are chosen to maximize the PDF difference between the generated samples and the data using Wasserstein distance at each iteration. A patch based approach is adopted to model the images in a hierarchical way, enabling the model to scale well to high dimensions. \r\n\r\nUnlike GANs, SIG has a NF structure and allows efficient likelihood evaluations that can be used in downstream tasks. While SIG has a deep neural network architecture, the approach deviates significantly from the current deep learning paradigm, as it does not use concepts such as mini-batching, stochastic gradient descent, gradient back-propagation through deep layers, or non-convex loss function optimization. SIG is very insensitive to hyper-parameter tuning, making it a useful generator tool for ML experts and non-experts alike.",
  "title": "Sliced Iterative Normalizing Flows",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Dense Contrastive Learning",
  "full_name": "Dense Contrastive Learning",
  "description": "**Dense Contrastive Learning** is a self-supervised learning method for dense prediction tasks. It implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Contrasting with regular contrastive loss, the contrastive loss is computed between the single feature vectors outputted by the global projection head, at the level of global feature, while the dense contrastive loss is computed between the dense feature vectors outputted by the dense projection head, at the level of local feature.",
  "title": "Dense Contrastive Learning for Self-Supervised Visual Pre-Training",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "DG-Net",
  "full_name": "Discriminative and Generative Network",
  "description": "",
  "title": "Joint Discriminative and Generative Learning for Person Re-identification",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "Inception-v3 Module",
  "full_name": "Inception-v3 Module",
  "description": "**Inception-v3 Module** is an image block used in the [Inception-v3](https://paperswithcode.com/method/inception-v3) architecture. This architecture is used on the coarsest (8 × 8) grids to promote high dimensional representations.",
  "title": "Rethinking the Inception Architecture for Computer Vision",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "WRQE",
  "full_name": "Weighted Recurrent Quality Enhancement",
  "description": "**Weighted Recurrent Quality Enhancement**, or **WRQE**, is a recurrent quality enhancement network for video compression that takes both compressed frames and the bit stream as inputs. In the recurrent cell of WRQE, the memory and update signal are weighted by quality features to reasonably leverage multi-frame information for enhancement.",
  "title": "Learning for Video Compression with Hierarchical Quality and Recurrent Enhancement",
  "collection": "Video Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "FractalNet",
  "full_name": "FractalNet",
  "description": "**FractalNet** is a type of convolutional neural network that eschews [residual connections](https://paperswithcode.com/method/residual-connection) in favour of a \"fractal\" design. They involve repeated application of a simple expansion rule to generate deep networks whose structural layouts are precisely truncated fractals. These networks contain interacting subpaths of different lengths, but do not include any pass-through or residual connections; every internal signal is transformed by a filter and nonlinearity before being seen by subsequent layers.",
  "title": "FractalNet: Ultra-Deep Neural Networks without Residuals",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "CRISS",
  "full_name": "CRISS",
  "description": "**CRISS**, or **Cross-lingual Retrievial for Iterative Self-Supervised Training (CRISS)**, is a self-supervised learning method for multilingual sequence generation. CRISS is developed based on the finding that the encoder outputs of multilingual denoising autoencoder can be used as language agnostic representation to retrieve parallel sentence pairs, and training the model on these retrieved sentence pairs can further improve its sentence retrieval and translation capabilities in an iterative manner. Using only unlabeled data from many different languages, CRISS iteratively mines for parallel sentences across languages, trains a new better multilingual model using these mined sentence pairs, mines again for better parallel sentences, and repeats.",
  "title": "Cross-lingual Retrieval for Iterative Self-Supervised Training",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "PAA",
  "full_name": "Patch AutoAugment",
  "description": "**Patch AutoAugment** is a patch-level automatic data augmentation algorithm that automatically searches for the optimal augmentation policies for the patches of an image. Specifically, PAA allows each patch DA operation to be controlled by an agent and models it as a Multi-Agent Reinforcement Learning (MARL) problem. At each step, PAA samples the most effective operation for each patch based on its content and the semantics of the whole image. The agents cooperate as a team and share a unified team reward for achieving the joint optimal DA policy of the whole image. PAA is co-trained with a target network through adversarial training. At each step, the policy network samples the most effective operation for each patch based on its content and the semantics of the image.",
  "title": "Local Patch AutoAugment with Multi-Agent Collaboration",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "Diffusion",
  "full_name": "Diffusion",
  "description": "Diffusion models generate samples by gradually\r\nremoving noise from a signal, and their training objective can be expressed as a reweighted variational lower-bound (https://arxiv.org/abs/2006.11239).",
  "title": "Denoising Diffusion Probabilistic Models",
  "collection": "Image Generation Models",
  "area": "Computer Vision"
}
{
  "name": "T2T-ViT",
  "full_name": "Tokens-To-Token Vision Transformer",
  "description": "**T2T-ViT** (Tokens-To-Token Vision Transformer) is a type of [Vision Transformer](https://paperswithcode.com/method/vision-transformer) which incorporates 1) a layerwise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision [transformer](https://paperswithcode.com/method/transformer) motivated by CNN architecture design after empirical study.",
  "title": "Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "CReLU",
  "full_name": "CReLU",
  "description": "**CReLU**, or **Concatenated Rectified Linear Units**, is a type of activation function which preserves both positive and negative phase information while enforcing non-saturated non-linearity. We compute by concatenating the layer output $h$ as:\r\n\r\n$$ \\left[\\text{ReLU}\\left(h\\right), \\text{ReLU}\\left(-h\\right)\\right] $$",
  "title": "Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "Weight Normalization",
  "full_name": "Weight Normalization",
  "description": "**Weight Normalization** is a normalization method for training neural networks. It is inspired by [batch normalization](https://paperswithcode.com/method/batch-normalization), but it is a deterministic method that does not share batch normalization's property of adding noise to the gradients. It reparameterizes each $k$-dimentional weight vector $\\textbf{w}$ in terms of a parameter vector $\\textbf{v}$ and a scalar parameter $g$ and to perform stochastic gradient descent with respect to those parameters instead. Weight vectors are expressed in terms of the new parameters using:\r\n\r\n$$ \\textbf{w} = \\frac{g}{\\Vert\\\\textbf{v}\\Vert}\\textbf{v}$$\r\n\r\nwhere $\\textbf{v}$ is a $k$-dimensional vector, $g$ is a scalar, and $\\Vert\\textbf{v}\\Vert$ denotes the Euclidean norm of $\\textbf{v}$. This reparameterization has the effect of fixing the Euclidean norm of the weight vector $\\textbf{w}$: we now have $\\Vert\\textbf{w}\\Vert = g$, independent of the parameters $\\textbf{v}$.",
  "title": "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "Attentional Liquid Warping GAN",
  "full_name": "Attentional Liquid Warping GAN",
  "description": "**Attentional Liquid Warping GAN** is a type of generative adversarial network for human image synthesis that utilizes a [AttLWB](https://paperswithcode.com/method/attlwb) block, which is a 3D body mesh recovery module that disentangles pose and shape. To preserve the source information, such as texture, style, color, and face identity, the Attentional Liquid Warping GAN with AttLWB propagates the source information in both image and feature spaces to the synthesized reference.",
  "title": "Liquid Warping GAN with Attention: A Unified Framework for Human Image Synthesis",
  "collection": "Generative Adversarial Networks",
  "area": "Computer Vision"
}
{
  "name": "ResMLP",
  "full_name": "Residual Multi-Layer Perceptrons",
  "description": "**Residual Multi-Layer Perceptrons**, or **ResMLP**, is an architecture built entirely upon [multi-layer perceptrons](https://paperswithcode.com/methods/category/feedforward-networks) for image classification. It is a simple [residual network](https://paperswithcode.com/method/residual-connection) that alternates (i) a [linear layer](https://paperswithcode.com/method/linear-layer) in which image patches interact, independently and identically across channels, and (ii) a two-layer [feed-forward network](https://paperswithcode.com/method/feedforward-network) in which channels interact independently per patch. At the end of the network, the patch representations are average pooled, and fed to a linear classifier.\r\n\r\n[Layer normalization](https://paperswithcode.com/method/layer-normalization) is replaced with a simpler [affine transformation](https://paperswithcode.com/method/affine-operator), thanks to the absence of self-attention layers which makes training more stable. The affine operator is applied at the beginning (\"pre-normalization\") and end (\"post-normalization\") of each residual block. As a pre-normalization, Aff replaces LayerNorm without using channel-wise statistics. Initialization is achieved as $\\mathbf{\\alpha}=\\mathbf{1}$, and $\\mathbf{\\beta}=\\mathbf{0}$. As a post-normalization, Aff is similar to [LayerScale](https://paperswithcode.com/method/layerscale) and $\\mathbf{\\alpha}$ is initialized with the same small value.",
  "title": "ResMLP: Feedforward networks for image classification with data-efficient training",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "Internet Explorer",
  "full_name": "Internet Explorer",
  "description": "Internet Explorer explores the web in a self-supervised manner to progressively find relevant examples that improve performance on a desired target dataset. It cycles between searching for images on the Internet with text queries, self-supervised training on downloaded images, determining which images were useful, and prioritizing what to search for next.",
  "title": "Internet Explorer: Targeted Representation Learning on the Open Web",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "BIMAN",
  "full_name": "BIMAN",
  "description": "**BIMAN**, or **Bot Identification by commit Message, commit Association, and author Name**, is a technique to detect bots that commit code. It is comprised of three methods that consider independent aspects of the commits made by a particular author: 1) Commit Message: Identify if commit messages are being generated from templates; 2) Commit Association: Predict if an author is a bot using a random forest model, with features related to files and projects associated with the commits as predictors; and 3) Author Name: Match author’s name and email to common bot patterns.",
  "title": "Detecting and Characterizing Bots that Commit Code",
  "collection": "Bot Detection",
  "area": "General"
}
{
  "name": "nnFormer",
  "full_name": "nnFormer",
  "description": "**nnFormer**, or **not-another transFormer**, is a semantic segmentation model with an interleaved architecture based on empirical combination of self-attention and [convolution](https://paperswithcode.com/method/convolution). Firstly, a light-weight convolutional embedding layer ahead is used ahead of [transformer](https://paperswithcode.com/method/transformer) blocks. In comparison to directly flattening raw pixels and applying 1D pre-processing, the convolutional embedding layer encodes precise (i.e., pixel-level) spatial information and provide low-level yet high-resolution 3D features. After the embedding block, transformer and convolutional down-sampling blocks are interleaved to fully entangle long-term dependencies with high-level and hierarchical object concepts at various scales, which helps improve the generalization ability and robustness of learned representations.",
  "title": "nnFormer: Interleaved Transformer for Volumetric Segmentation",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "Locally-Grouped Self-Attention",
  "full_name": "Locally-Grouped Self-Attention",
  "description": "**Locally-Grouped Self-Attention**, or **LSA**, is a local attention mechanism used in the [Twins-SVT](https://paperswithcode.com/method/twins-svt) architecture. Locally-grouped self-attention (LSA). Motivated by the group design in depthwise convolutions for efficient inference, we first equally divide the 2D feature maps into sub-windows, making self-attention communications only happen within each sub-window. This design also resonates with the multi-head design in self-attention, where the communications only occur within the channels of the same head. To be specific, the feature maps are divided into $m \\times n$ sub-windows. Without loss of generality, we assume $H \\% m=0$ and $W \\% n=0$. Each group contains $\\frac{H W}{m n}$ elements, and thus the computation cost of the self-attention in this window is $\\mathcal{O}\\left(\\frac{H^{2} W^{2}}{m^{2} n^{2}} d\\right)$, and the total cost is $\\mathcal{O}\\left(\\frac{H^{2} W^{2}}{m n} d\\right)$. If we let $k\\_{1}=\\frac{H}{n}$ and $k\\_{2}=\\frac{W}{n}$, the cost can be computed as $\\mathcal{O}\\left(k\\_{1} k\\_{2} H W d\\right)$, which is significantly more efficient when $k\\_{1} \\ll H$ and $k\\_{2} \\ll W$ and grows linearly with $H W$ if $k\\_{1}$ and $k\\_{2}$ are fixed.\r\n\r\nAlthough the locally-grouped self-attention mechanism is computation friendly, the image is divided into non-overlapping sub-windows. Thus, we need a mechanism to communicate between different sub-windows, as in Swin. Otherwise, the information would be limited to be processed locally, which makes the receptive field small and significantly degrades the performance as shown in our experiments. This resembles the fact that we cannot replace all standard convolutions by depth-wise convolutions in CNNs.",
  "title": "Twins: Revisiting the Design of Spatial Attention in Vision Transformers",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "PAFPN",
  "full_name": "PAFPN",
  "description": "**PAFPN** is a feature pyramid module used in Path Aggregation networks ([PANet](https://paperswithcode.com/method/panet)) that combines FPNs with [bottom-up path augmentation](https://paperswithcode.com/method/bottom-up-path-augmentation), which shortens the information path between lower layers and topmost feature.",
  "title": "Path Aggregation Network for Instance Segmentation",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "Dynamic R-CNN",
  "full_name": "Dynamic R-CNN",
  "description": "**Dynamic R-CNN** is an object detection method that adjusts the label assignment criteria (IoU threshold) and the shape of regression loss function (parameters of Smooth L1 Loss) automatically based on the statistics of proposals during training. The motivation is that in previous two-stage object detectors, there is an inconsistency problem between the fixed network settings and the dynamic training procedure. For example, the fixed label assignment strategy and regression loss function cannot fit the distribution change of proposals and thus are harmful to training high quality detectors.\r\n\r\nIt consists of two components: Dynamic Label Assignment and Dynamic Smooth L1 Loss, which are designed for the classification and regression branches, respectively. \r\n\r\nFor Dynamic Label Assignment, we want our model to be discriminative for high IoU proposals, so we gradually adjust the IoU threshold for positive/negative samples based on the proposals distribution in the training procedure. Specifically, we set the threshold as the IoU of the proposal at a certain percentage since it can reflect the quality of the overall distribution. \r\n\r\nFor Dynamic Smooth L1 Loss, we want to change the shape of the regression loss function to adaptively fit the distribution change of error and ensure the contribution of high quality samples to training. This is achieved by adjusting the $\\beta$ in Smooth L1 Loss based on the error distribution of the regression loss function, in which $\\beta$ actually controls the magnitude of the gradient of small errors.",
  "title": "Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "CayleyNet",
  "full_name": "CayleyNet",
  "description": "The core ingredient of **CayleyNet** is a new class of parametric rational complex functions (Cayley polynomials) allowing to efficiently compute spectral filters on graphs that specialize on frequency bands of interest. The model generates rich spectral filters that are localized in space, scales linearly with the size of the input data for sparsely-connected graphs, and can handle different constructions of Laplacian operators.\r\n\r\nDescription adapted from: [CayleyNets: Graph Convolutional Neural Networks with Complex Rational Spectral Filters](https://arxiv.org/pdf/1705.07664.pdf)",
  "title": "CayleyNets: Graph Convolutional Neural Networks with Complex Rational Spectral Filters",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "FFB6D",
  "full_name": "FFB6D",
  "description": "**FFB6D** is a full flow bidirectional fusion network for 6D pose estimation of known objects from a single RGBD image. Unlike previous works that extract the RGB and point cloud features independently and fuse them in the final stage, FFB6D builds bidirectional fusion modules as communication bridges in the full flow of the two networks. In this way, the two networks can obtain complementary information from the other and learn representations containing rich appearance and geometry information of the scene.",
  "title": "FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation",
  "collection": "6D Pose Estimation Models",
  "area": "Computer Vision"
}
{
  "name": "HRank",
  "full_name": "HRank",
  "description": "**HRank** is a filter pruning method that explores the High Rank of the feature map in each layer (HRank). The proposed HRank  is inspired by the discovery that the average rank of multiple feature maps generated by a single filter is always the same, regardless of the number of image batches CNNs receive. Based on HRank, the authors develop a method that is mathematically formulated to prune filters with low-rank feature maps.",
  "title": "HRank: Filter Pruning using High-Rank Feature Map",
  "collection": "Pruning",
  "area": "General"
}
{
  "name": "SynaNN",
  "full_name": "Synaptic Neural Network",
  "description": "A Synaptic Neural Network (SynaNN) consists of synapses and neurons. Inspired by the synapse research of neuroscience, we built a synapse model with a nonlinear and log-concave synapse function of excitatory and inhibitory probabilities of channels.",
  "title": "A Synaptic Neural Network and Synapse Learning",
  "collection": "Neural Architecture Search",
  "area": "General"
}
{
  "name": "MDL",
  "full_name": "Minimum Description Length",
  "description": "**Minimum Description Length** provides a criterion for the selection of models, regardless of their complexity, without the restrictive assumption that the data form a sample from a 'true' distribution.\r\n\r\nExtracted from [scholarpedia](http://scholarpedia.org/article/Minimum_description_length)\r\n\r\n**Source**:\r\n\r\nPaper: [J. Rissanen (1978) Modeling by the shortest data description. Automatica 14, 465-471](https://doi.org/10.1016/0005-1098(78)90005-5)\r\n\r\nBook: [P. D. Grünwald (2007) The Minimum Description Length Principle, MIT Press, June 2007, 570 pages](https://ieeexplore.ieee.org/servlet/opac?bknumber=6267274)",
  "title": null,
  "collection": "AutoML",
  "area": "General"
}
{
  "name": "Random Mutation Search",
  "full_name": "Random Mutation Search",
  "description": "",
  "title": "Self-Constructing Neural Networks Through Random Mutation",
  "collection": "Neural Architecture Search",
  "area": "General"
}
{
  "name": "Gaussian Process",
  "full_name": "Gaussian Process",
  "description": "**Gaussian Processes** are non-parametric models for approximating functions. They rely upon a measure of similarity between points (the kernel function) to predict the value for an unseen point from training data. The models are fully probabilistic so uncertainty bounds are baked in with the model.\r\n\r\nImage Source: Gaussian Processes for Machine Learning, C. E. Rasmussen & C. K. I. Williams",
  "title": null,
  "collection": "Non-Parametric Classification",
  "area": "General"
}
{
  "name": "DAMO-YOLO",
  "full_name": "DAMO-YOLO",
  "description": "",
  "title": "DAMO-YOLO : A Report on Real-Time Object Detection Design",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "DINO",
  "full_name": "self-DIstillation with NO labels",
  "description": "**DINO** (self-distillation with no labels) is a self-supervised learning method that directly predicts the output of a teacher network - built with a momentum encoder - using a standard cross-entropy loss. \r\n\r\nIn the example to the right, DINO is illustrated in the case of one single pair of views $\\left(x\\_{1}, x\\_{2}\\right)$ for simplicity.\r\nThe model passes two different random transformations of an input image to the student and teacher networks. Both networks have the same architecture but other parameters.\r\nThe output of the teacher network is centered with a mean computed over the batch. Each network outputs a $K$ dimensional feature normalized with a temperature [softmax](https://paperswithcode.com/method/softmax) over the feature dimension.\r\nTheir similarity is then measured with a cross-entropy loss.\r\nA stop-gradient (sg) operator is applied to the teacher to propagate gradients only through the student.\r\nThe teacher parameters are updated with the student parameters' exponential moving average (ema).",
  "title": "Emerging Properties in Self-Supervised Vision Transformers",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "SAGAN Self-Attention Module",
  "full_name": "SAGAN Self-Attention Module",
  "description": "The **SAGAN Self-Attention Module** is a self-attention module used in the [Self-Attention GAN](https://paperswithcode.com/method/sagan) architecture for image synthesis. In the module, image features from the previous hidden layer $\\textbf{x} \\in \\mathbb{R}^{C\\text{x}N}$ are first transformed into two feature spaces $\\textbf{f}$, $\\textbf{g}$ to calculate the attention, where $\\textbf{f(x) = W}\\_{\\textbf{f}}{\\textbf{x}}$, $\\textbf{g}(\\textbf{x})=\\textbf{W}\\_{\\textbf{g}}\\textbf{x}$. We then calculate:\r\n\r\n$$\\beta_{j, i} = \\frac{\\exp\\left(s_{ij}\\right)}{\\sum^{N}\\_{i=1}\\exp\\left(s_{ij}\\right)} $$\r\n\r\n$$ \\text{where } s_{ij} = \\textbf{f}(\\textbf{x}\\_{i})^{T}\\textbf{g}(\\textbf{x}\\_{i}) $$\r\n\r\nand $\\beta_{j, i}$ indicates the extent to which the model attends to the $i$th location when synthesizing the $j$th region. Here, $C$ is the number of channels and $N$ is the number of feature\r\nlocations of features from the previous hidden layer. The output of the attention layer is $\\textbf{o} = \\left(\\textbf{o}\\_{\\textbf{1}}, \\textbf{o}\\_{\\textbf{2}}, \\ldots, \\textbf{o}\\_{\\textbf{j}} , \\ldots, \\textbf{o}\\_{\\textbf{N}}\\right) \\in \\mathbb{R}^{C\\text{x}N}$ , where,\r\n\r\n$$ \\textbf{o}\\_{\\textbf{j}} = \\textbf{v}\\left(\\sum^{N}\\_{i=1}\\beta_{j, i}\\textbf{h}\\left(\\textbf{x}\\_{\\textbf{i}}\\right)\\right) $$\r\n\r\n$$ \\textbf{h}\\left(\\textbf{x}\\_{\\textbf{i}}\\right) = \\textbf{W}\\_{\\textbf{h}}\\textbf{x}\\_{\\textbf{i}} $$\r\n\r\n$$ \\textbf{v}\\left(\\textbf{x}\\_{\\textbf{i}}\\right) = \\textbf{W}\\_{\\textbf{v}}\\textbf{x}\\_{\\textbf{i}} $$\r\n\r\nIn the above formulation, $\\textbf{W}\\_{\\textbf{g}} \\in \\mathbb{R}^{\\bar{C}\\text{x}C}$, $\\mathbf{W}\\_{f} \\in \\mathbb{R}^{\\bar{C}\\text{x}C}$, $\\textbf{W}\\_{\\textbf{h}} \\in \\mathbb{R}^{\\bar{C}\\text{x}C}$ and $\\textbf{W}\\_{\\textbf{v}} \\in \\mathbb{R}^{C\\text{x}\\bar{C}}$ are the learned weight matrices, which are implemented as $1$×$1$ convolutions. The authors choose  $\\bar{C} = C/8$.\r\n\r\nIn addition, the module further multiplies the output of the attention layer by a scale parameter and adds back the input feature map. Therefore, the final output is given by,\r\n\r\n$$\\textbf{y}\\_{\\textbf{i}} = \\gamma\\textbf{o}\\_{\\textbf{i}} + \\textbf{x}\\_{\\textbf{i}}$$\r\n\r\nwhere $\\gamma$ is a learnable scalar and it is initialized as 0. Introducing $\\gamma$ allows the network to first rely on the cues in the local neighborhood – since this is easier – and then gradually learn to assign more weight to the non-local evidence.",
  "title": "Self-Attention Generative Adversarial Networks",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "SCARF",
  "full_name": "SCARF",
  "description": "SCARF is a simple, widely-applicable technique for contrastive learning, where views are formed by corrupting a random subset of features. When applied to pre-train deep neural networks on the 69 real-world, tabular classification datasets from the OpenML-CC18 benchmark, SCARF not only improves classification accuracy in the fully-supervised setting but does so also in the presence of label noise and in the semi-supervised setting where only a fraction of the available training data is labeled.",
  "title": "SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "Gather-Excite Networks",
  "full_name": "Gather-Excite Networks",
  "description": "GENet combines part gathering and excitation operations. In the first step, it aggregates input features over large neighborhoods and models the relationship between different spatial locations. In the second step, it first generates an attention map of the same size as the input feature map, using interpolation. Then each position in the input feature map is scaled by multiplying by the corresponding element in the attention map.",
  "title": "Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Deformable Attention Module",
  "full_name": "Deformable Attention Module",
  "description": "**Deformable Attention Module** is an attention module used in the [Deformable DETR](https://paperswithcode.com/method/deformable-detr) architecture, which seeks to overcome one issue base [Transformer attention](https://paperswithcode.com/method/scaled) in that it looks over all possible spatial locations. Inspired by [deformable convolution](https://paperswithcode.com/method/deformable-convolution), the deformable attention module only attends to a small set of key sampling points around a reference point, regardless of the spatial size of the feature maps. By assigning only a small fixed number of keys for each query, the issues of convergence and feature spatial resolution can be mitigated.\r\n\r\nGiven an input feature map $x \\in \\mathbb{R}^{C \\times H \\times W}$, let $q$ index a query element with content feature $\\mathbf{z}\\_{q}$ and a 2-d reference point $\\mathbf{p}\\_{q}$, the deformable attention feature is calculated by:\r\n\r\n$$ \\text{DeformAttn}\\left(\\mathbf{z}\\_{q}, \\mathbf{p}\\_{q}, \\mathbf{x}\\right)=\\sum\\_{m=1}^{M} \\mathbf{W}\\_{m}\\left[\\sum\\_{k=1}^{K} A\\_{m q k} \\cdot \\mathbf{W}\\_{m}^{\\prime} \\mathbf{x}\\left(\\mathbf{p}\\_{q}+\\Delta \\mathbf{p}\\_{m q k}\\right)\\right]\r\n$$\r\n\r\nwhere $m$ indexes the attention head, $k$ indexes the sampled keys, and $K$ is the total sampled key number $(K \\ll H W) . \\Delta p_{m q k}$ and $A_{m q k}$ denote the sampling offset and attention weight of the $k^{\\text {th }}$ sampling point in the $m^{\\text {th }}$ attention head, respectively. The scalar attention weight $A_{m q k}$ lies in the range $[0,1]$, normalized by $\\sum_{k=1}^{K} A_{m q k}=1 . \\Delta \\mathbf{p}_{m q k} \\in \\mathbb{R}^{2}$ are of 2-d real numbers with unconstrained range. As $p\\_{q}+\\Delta p\\_{m q k}$ is fractional, bilinear interpolation is applied as in Dai et al. (2017) in computing $\\mathbf{x}\\left(\\mathbf{p}\\_{q}+\\Delta \\mathbf{p}\\_{m q k}\\right)$. Both $\\Delta \\mathbf{p}\\_{m q k}$ and $A\\_{m q k}$ are obtained via linear projection over the query feature $z\\_{q} .$ In implementation, the query feature $z\\_{q}$ is fed to a linear projection operator of $3 M K$ channels, where the first $2 M K$ channels encode the sampling offsets $\\Delta p\\_{m q k}$, and the remaining $M K$ channels are fed to a softmax operator to obtain the attention weights $A\\_{m q k}$.",
  "title": "Deformable DETR: Deformable Transformers for End-to-End Object Detection",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "MoGA-C",
  "full_name": "MoGA-C",
  "description": "**MoGA-C** is a convolutional neural network optimized for mobile latency and discovered via Mobile GPU-Aware (MoGA) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search). The basic building block is MBConvs (inverted residual blocks) from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2). Squeeze-and-excitation layers are also experimented with.",
  "title": "MoGA: Searching Beyond MobileNetV3",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Sparse Transformer",
  "full_name": "Sparse Transformer",
  "description": "A **Sparse Transformer** is a [Transformer](https://paperswithcode.com/method/transformer) based architecture which utilises sparse factorizations of the attention matrix to reduce time/memory to $O(n \\sqrt{n})$. Other changes to the Transformer architecture include: (a) a restructured [residual block](https://paperswithcode.com/method/residual-block) and weight initialization, (b) A set of sparse attention kernels which efficiently compute subsets of the attention matrix, (c) recomputation of attention weights during the backwards pass to reduce memory usage",
  "title": "Generating Long Sequences with Sparse Transformers",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Gaussian Affinity",
  "full_name": "Gaussian Affinity",
  "description": "**Gaussian Affinity** is a type of affinity or self-similarity function between two points $\\mathbb{x\\_{i}}$ and $\\mathbb{x\\_{j}}$ that uses a Gaussian function:\r\n\r\n$$ f\\left(\\mathbb{x\\_{i}}, \\mathbb{x\\_{j}}\\right) = e^{\\mathbb{x^{T}\\_{i}}\\mathbb{x\\_{j}}} $$\r\n\r\nHere $\\mathbb{x^{T}\\_{i}}\\mathbb{x\\_{j}}$ is dot-product similarity.",
  "title": null,
  "collection": "Affinity Functions",
  "area": "General"
}
{
  "name": "RealNVP",
  "full_name": "RealNVP",
  "description": "**RealNVP** is a generative model that utilises real-valued non-volume preserving (real NVP) transformations for density estimation. The model can perform efficient and exact inference, sampling and log-density estimation of data points.",
  "title": "Density estimation using Real NVP",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "Canvas Method",
  "full_name": "Canvas Method",
  "description": "**Canvas Method** is a method for inference attacks on object detection models. It draws a predicted bounding box distribution on an empty canvas for an attack model input. The canvas is initially set to an image of 300$\\times$300 pixels in size, where every pixel has a value of zero and the boxes drawn on the canvas have the same center as the predicted boxes and the same intensity as the prediction scores.",
  "title": "Membership Inference Attacks Against Object Detection Models",
  "collection": "Inference Attack",
  "area": "General"
}
{
  "name": "Deflation",
  "full_name": "Deflation",
  "description": "**Deflation** is a video-to-image operation to transform a video network into a network that can ingest a single image. In the two types of video networks considered in the original paper, this deflation corresponds to the following operations: for [3D convolutional based networks](https://paperswithcode.com/method/3d-convolution), summing the 3D spatio-temporal filters over the temporal dimension to obtain 2D filters; for TSM networks,, turning off the channel shifting which results in a standard [residual architecture](https://paperswithcode.com/method/resnet) (ResNet50) for images.",
  "title": "Self-Supervised MultiModal Versatile Networks",
  "collection": "Miscellaneous Components",
  "area": "General"
}
{
  "name": "FORK",
  "full_name": "Forward-Looking Actor",
  "description": "**FORK**, or **Forward Looking Actor** is a type of actor for actor-critic algorithms. In particular, FORK includes a neural network that forecasts the next state given the current state and current action, called system network; and a neural network that forecasts the\r\nreward given a (state, action) pair, called reward network. With the system network and reward network, FORK can forecast the next state and consider the value of the next state when improving the policy.",
  "title": "FORK: A Forward-Looking Actor For Model-Free Reinforcement Learning",
  "collection": "Actor-Critic Algorithms",
  "area": "Reinforcement Learning"
}
{
  "name": "Kaiming Initialization",
  "full_name": "Kaiming Initialization",
  "description": "**Kaiming Initialization**, or **He Initialization**, is an initialization method for neural networks that takes into account the non-linearity of activation functions, such as [ReLU](https://paperswithcode.com/method/relu) activations.\r\n\r\nA proper initialization method should avoid reducing or magnifying the magnitudes of input signals exponentially. Using a derivation they work out that the condition to stop this happening is:\r\n\r\n$$\\frac{1}{2}n\\_{l}\\text{Var}\\left[w\\_{l}\\right] = 1 $$\r\n\r\nThis implies an initialization scheme of:\r\n\r\n$$ w\\_{l} \\sim \\mathcal{N}\\left(0,  2/n\\_{l}\\right)$$\r\n\r\nThat is, a zero-centered Gaussian with standard deviation of $\\sqrt{2/{n}\\_{l}}$ (variance shown in equation above). Biases are initialized at $0$.",
  "title": "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification",
  "collection": "Initialization",
  "area": "General"
}
{
  "name": "AltDiffusion",
  "full_name": "AltDiffusion",
  "description": "In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model. Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding. Our models and code are available at https://github.com/FlagAI-Open/FlagAI.",
  "title": "AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities",
  "collection": "Image Generation Models",
  "area": "Computer Vision"
}
{
  "name": "Batch Nuclear-norm Maximization",
  "full_name": "Batch Nuclear-norm Maximization",
  "description": "**Batch Nuclear-norm Maximization** is an approach for aiding classification in label insufficient situations. It involves maximizing the nuclear-norm of the batch output matrix. The nuclear-norm of a matrix is an upper bound of the Frobenius-norm of the matrix. Maximizing nuclear-norm ensures large Frobenius-norm of the batch matrix, which leads to increased discriminability. The nuclear-norm of the batch matrix is also a convex approximation of the matrix rank, which refers to the prediction diversity.",
  "title": "Towards Discriminability and Diversity: Batch Nuclear-norm Maximization under Label Insufficient Situations",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "OCD",
  "full_name": "Overfitting Conditional Diffusion Model",
  "description": "",
  "title": "OCD: Learning to Overfit with Conditional Diffusion Models",
  "collection": "Meta-Learning Algorithms",
  "area": "General"
}
{
  "name": "FRILL",
  "full_name": "FRILL",
  "description": "**FRILL** is a non-semantic speech embedding model trained via knowledge distillation that is fast enough to be run in real-time on a mobile device. The fastest model runs at 0.9 ms, which is 300x faster than TRILL and 25x faster than TRILL-distilled.",
  "title": "FRILL: A Non-Semantic Speech Embedding for Mobile Devices",
  "collection": "Speech Embeddings",
  "area": "Audio"
}
{
  "name": "2D DWT",
  "full_name": "2D Discrete Wavelet Transform",
  "description": "",
  "title": "WaveMix: Multi-Resolution Token Mixing for Images",
  "collection": "Backbone Architectures",
  "area": "Computer Vision"
}
{
  "name": "DARTS Max-W",
  "full_name": "Differentiable Architecture Search Max-W",
  "description": "Like [DARTS](https://paperswithcode.com/method/darts), except subtract the max weight gradients.\r\n\r\nMax-W Weighting:\r\n\\begin{equation}\r\noutput_i = (1 - max(w) + w_i) * op_i(input_i)\r\n\\label{eqn:max_w}\r\n\\end{equation}",
  "title": "sharpDARTS: Faster and More Accurate Differentiable Architecture Search",
  "collection": "Neural Architecture Search",
  "area": "General"
}
{
  "name": "Cross-encoder Reranking",
  "full_name": "Cross-encoder Reranking",
  "description": "Cross-encoder Reranking",
  "title": "ReRankMatch: Semi-Supervised Learning with Semantics-Oriented Similarity Representation",
  "collection": "Language Models",
  "area": "Natural Language Processing"
}
{
  "name": "DistDGL",
  "full_name": "DistDGL",
  "description": "**DistDGL** is a system for training GNNs in a mini-batch fashion on a cluster of machines. It is is based on the Deep Graph Library (DGL), a popular GNN development framework. DistDGL distributes the graph and its associated data (initial features and embeddings) across the machines and uses this distribution to derive a computational decomposition by following an owner-compute rule. DistDGL follows a synchronous training approach and allows ego-networks forming the mini-batches to include non-local nodes. To minimize the overheads associated with distributed computations, DistDGL uses a high-quality and light-weight mincut graph partitioning algorithm along with multiple balancing constraints. This allows it to reduce communication overheads and statically balance the computations. It further reduces the communication by replicating halo nodes and by using sparse embedding updates. The combination of these design choices allows DistDGL to train high-quality models while achieving high parallel efficiency and memory scalability",
  "title": "DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs",
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "CP-N3",
  "full_name": "Canonical Tensor Decomposition with N3 Regularizer",
  "description": "Canonical Tensor Decomposition, trained with N3 regularizer",
  "title": "Canonical Tensor Decomposition for Knowledge Base Completion",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "LTLS",
  "full_name": "Log-time and Log-space Extreme Classification",
  "description": "**LTLS** is a technique for multiclass and multilabel prediction that can perform training and inference in logarithmic time and space. LTLS embeds large classification problems into simple structured prediction problems and relies on efficient dynamic programming algorithms for inference. It tackles extreme multi-class and multi-label classification problems where the size $C$ of the output space is extremely large.",
  "title": "Log-time and Log-space Extreme Classification",
  "collection": "Structured Prediction",
  "area": "General"
}
{
  "name": "Highway networks",
  "full_name": "Highway networks",
  "description": "There is plenty of theoretical and empirical evidence that depth of neural networks is a crucial ingredient for their success. However, network training becomes more difficult with increasing depth and training of very deep networks remains an open problem. In this extended abstract, we introduce a new architecture designed to ease gradient-based training of very deep networks. We refer to networks with this architecture as highway networks, since they allow unimpeded information flow across several layers on \"information highways\". The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety of activation functions, opening up the possibility of studying extremely deep and efficient architectures.",
  "title": "Highway Networks",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "Reduction-A",
  "full_name": "Reduction-A",
  "description": "**Reduction-A** is an image model block used in the [Inception-v4](https://paperswithcode.com/method/inception-v4) architecture.",
  "title": "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "SRN",
  "full_name": "Stable Rank Normalization",
  "description": "**Stable Rank Normalization (SRN)** is a weight-normalization scheme which minimizes the\r\nstable rank of a linear operator. It simultaneously controls the Lipschitz constant and the stable rank of a linear operator. Stable rank is a softer version of the rank operator and is defined as the squared ratio of the Frobenius norm to the spectral norm.",
  "title": "Stable Rank Normalization for Improved Generalization in Neural Networks and GANs",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "COLA",
  "full_name": "COLA",
  "description": "**COLA** is a self-supervised pre-training approach for learning a general-purpose representation of audio. It is based on contrastive learning: it learns a representation which assigns high similarity to audio segments extracted from the same recording while assigning lower similarity to segments from different recordings.",
  "title": "Contrastive Learning of General-Purpose Audio Representations",
  "collection": "Generative Audio Models",
  "area": "Audio"
}
{
  "name": "3D CNN",
  "full_name": "3 Dimensional Convolutional Neural Network",
  "description": "",
  "title": "Uniformizing Techniques to Process CT scans with 3D CNNs for Tuberculosis Prediction",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "DeepLabv3",
  "full_name": "DeepLabv3",
  "description": "**DeepLabv3** is a semantic segmentation architecture that improves upon [DeepLabv2](https://paperswithcode.com/method/deeplabv2) with several modifications. To handle the problem of segmenting objects at multiple scales, modules are designed which employ atrous [convolution](https://paperswithcode.com/method/convolution) in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates. Furthermore, the Atrous [Spatial Pyramid Pooling](https://paperswithcode.com/method/spatial-pyramid-pooling) module from DeepLabv2 augmented with image-level features encoding global context and further boost performance. \r\n\r\nThe changes to the ASSP module are that the authors apply [global average pooling](https://paperswithcode.com/method/global-average-pooling) on the last feature map of the model, feed the resulting image-level features to a 1 × 1 convolution with 256 filters (and [batch normalization](https://paperswithcode.com/method/batch-normalization)), and then bilinearly upsample the feature to the desired spatial dimension. In the\r\nend, the improved [ASPP](https://paperswithcode.com/method/aspp) consists of (a) one 1×1 convolution and three 3 × 3 convolutions with rates = (6, 12, 18) when output stride = 16 (all with 256 filters and batch normalization), and (b) the image-level features.\r\n\r\nAnother interesting difference is that DenseCRF post-processing from DeepLabv2 is no longer needed.",
  "title": "Rethinking Atrous Convolution for Semantic Image Segmentation",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "MobileNetV3",
  "full_name": "MobileNetV3",
  "description": "**MobileNetV3** is a convolutional neural network that is tuned to mobile phone CPUs through a combination of hardware-aware network architecture search (NAS) complemented by the [NetAdapt](https://paperswithcode.com/method/netadapt) algorithm, and then subsequently improved through novel architecture advances. Advances include (1) complementary search techniques, (2) new efficient versions of nonlinearities practical for the mobile setting, (3) new efficient network design.\r\n\r\nThe network design includes the use of a [hard swish](https://paperswithcode.com/method/hard-swish) activation and squeeze-and-excitation modules in the MBConv blocks.",
  "title": "Searching for MobileNetV3",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Content-Conditioned Style Encoder",
  "full_name": "Content-Conditioned Style Encoder",
  "description": "The **Content-Conditioned Style Encoder**, or **COCO**, is a style encoder used for image-to-image translation in the [COCO-FUNIT](https://paperswithcode.com/method/coco-funit#) architecture.  Unlike the style encoder in [FUNIT](https://arxiv.org/abs/1905.01723), COCO takes both content and style image as input. With this content conditioning scheme, we create a direct feedback path during learning to let the content image influence how the style code is computed. It also helps reduce the direct influence of the style image to the extract style code.\r\n\r\nThe bottom part of the Figure details architecture. First, the content image is fed into an encoder $E\\_{S, C}$ to compute a spatial feature map. This content feature map is then mean-pooled and mapped to a vector $\\zeta\\_{c} .$ Similarly, the style image is fed into encoder $E\\_{S, S}$ to compute a spatial feature map. The style feature map is then mean-pooled and concatenated with an input-independent bias vector: the constant style bias (CSB). Note that while the regular bias in deep networks is added to the activations, in CSB, the bias is concatenated with the activations. The CSB provides a fixed input to the style encoder, which helps compute a style code that is less sensitive to the variations in the style image.\r\n\r\nThe concatenation of the style vector and the CSB is mapped to a vector $\\zeta\\_{s}$ via a fully connected layer. We then perform an element-wise product operation to $\\zeta\\_{c}$ and $\\zeta\\_{s}$, which is the final style code. The style code is then mapped to produce the [AdaIN](https://paperswithcode.com/method/adaptive-instance-normalization) parameters for generating the translation. Through this element-wise product operation, the resulting style code is heavily influenced by the content image. One way to look at this mechanism is that it produces a customized style code for the input content image.\r\n\r\nThe COCO is used as a drop-in replacement for the style encoder in FUNIT. Let $\\phi$ denote the COCO mapping. The translation output is then computed via\r\n\r\n$$\r\nz\\_{c}=E\\_{c}\\left(x_{c}\\right), z_{s}=\\phi\\left(E\\_{s, s}\\left(x_{s}\\right), E\\_{s, c}\\left(x\\_{c}\\right)\\right), \\overline{\\mathbf{x}}=F\\left(z\\_{c}, z\\_{s}\\right)\r\n$$\r\n\r\nThe style code extracted by the COCO is more robust to variations in the style image. Note that we set $E\\_{S, C} \\equiv E\\_{C}$ to keep the number of parameters in our model similar to that in FUNIT.",
  "title": "COCO-FUNIT: Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Hierarchical Network Dissection",
  "full_name": "Hierarchical Network Dissection",
  "description": "**Hierarchical Network Dissection** is a pipeline for interpreting the internal representation of face-centric inference models. Using a probabilistic formulation, Hierarchical Network Dissection pairs units of the model with concepts in a \"Face Dictionary\" (a collection of facial concepts with corresponding sample images). Interpretable units are discovered in a [convolution](https://paperswithcode.com/method/convolution) layer through HND to identify multiple instances of unit-concept affinity. The pipeline is inspired by [Network Dissection](https://paperswithcode.com/method/network-dissection), an interpretability model for object-centric and scene-centric models.",
  "title": "Interpreting Face Inference Models using Hierarchical Network Dissection",
  "collection": "Interpretability",
  "area": "General"
}
{
  "name": "Siamese U-Net",
  "full_name": "Siamese U-Net",
  "description": "Siamese U-Net model with a pre-trained ResNet34 architecture as an encoder for data efficient Change Detection",
  "title": "Deep Active Learning in Remote Sensing for data efficient Change Detection",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Self-Learning",
  "full_name": null,
  "description": "",
  "title": null,
  "collection": "Lifelong Learning",
  "area": "General"
}
{
  "name": "Trans-Encoder",
  "full_name": "Trans-Encoder",
  "description": "Unsupervised knowledge distillation from a pretrained language model to *itself*, by alternating between its bi- and cross-encoder forms.",
  "title": "Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations",
  "collection": "Sentence Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "LRNet",
  "full_name": "Local Relation Network",
  "description": "The **Local Relation Network** (**LR-Net**) is a network built with local relation layers which represent a feature image extractor. This feature extractor adaptively determines aggregation weights based on the compositional relationship of local pixel pairs.",
  "title": "Local Relation Networks for Image Recognition",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "GreedyNAS-C",
  "full_name": "GreedyNAS-C",
  "description": "**GreedyNAS-C** is a convolutional neural network discovered using the [GreedyNAS](https://paperswithcode.com/method/greedynas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) method. The basic building blocks used are inverted residual blocks (from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2)) and squeeze-and-excitation blocks.",
  "title": "GreedyNAS: Towards Fast One-Shot NAS with Greedy Supernet",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Florence",
  "full_name": "Florence",
  "description": "Florence is a computer vision foundation model aiming to learn universal visual-language representations that be adapted to various computer vision tasks, visual question answering, image captioning, video retrieval, among other tasks. Florence's workflow consists of data curation, unified learning, Transformer architectures and adaption. Florence is pre-trained in an image-label-description space, utilizing a unified image-text contrastive learning. It involves a two-tower architecture: 12-layer Transformer for the language encoder, and a Vision Transformer for the image encoder. Two linear projection layers are added on top of the image encoder and language encoder to match the dimensions of image and language features. Compared to previous methods for cross-modal shared representations, Florence expands beyond simple classification and retrieval capabilities to advanced representations that support object level, multiple modality, and videos respectively.",
  "title": "Florence: A New Foundation Model for Computer Vision",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "RBF",
  "full_name": "Radial Basis Function",
  "description": "",
  "title": null,
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "PointRend",
  "full_name": "PointRend",
  "description": "**PointRend** is a module for image segmentation tasks, such as instance and semantic segmentation, that attempts to treat segmentation as image rending problem to efficiently \"render\" high-quality label maps. It uses a subdivision strategy to adaptively select a non-uniform set of points at which to compute labels. PointRend can be incorporated into popular meta-architectures for both instance segmentation (e.g. [Mask R-CNN](https://paperswithcode.com/method/mask-r-cnn)) and semantic segmentation (e.g. [FCN](https://paperswithcode.com/method/fcn)). Its subdivision strategy efficiently computes high-resolution segmentation maps using an order of magnitude fewer floating-point operations than direct, dense computation.\r\n\r\nPointRend is a general module that admits many possible implementations. Viewed abstractly, a PointRend module accepts one or more typical CNN feature maps $f\\left(x\\_{i}, y\\_{i}\\right)$ that are defined over regular grids, and outputs high-resolution predictions $p\\left(x^{'}\\_{i}, y^{'}\\_{i}\\right)$ over a finer grid. Instead of making excessive predictions over all points on the output grid, PointRend makes predictions only on carefully selected points. To make these predictions, it extracts a point-wise feature representation for the selected points by interpolating $f$, and uses a small point head subnetwork to predict output labels from the point-wise features.",
  "title": "PointRend: Image Segmentation as Rendering",
  "collection": "Semantic Segmentation Modules",
  "area": "Computer Vision"
}
{
  "name": "Inception-ResNet-v2-A",
  "full_name": "Inception-ResNet-v2-A",
  "description": "**Inception-ResNet-v2-A** is an image model block for a 35 x 35 grid used in the [Inception-ResNet-v2](https://paperswithcode.com/method/inception-resnet-v2) architecture.",
  "title": "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "Accumulating Eligibility Trace",
  "full_name": "Accumulating Eligibility Trace",
  "description": "An **Accumulating Eligibility Trace** is a type of [eligibility trace](https://paperswithcode.com/method/eligibility-trace) where the trace increments in an accumulative way. For the memory vector $\\textbf{e}\\_{t} \\in \\mathbb{R}^{b} \\geq \\textbf{0}$:\r\n\r\n$$\\mathbf{e\\_{0}} = \\textbf{0}$$\r\n\r\n$$\\textbf{e}\\_{t} = \\nabla{\\hat{v}}\\left(S\\_{t}, \\mathbf{\\theta}\\_{t}\\right) + \\gamma\\lambda\\textbf{e}\\_{t}$$",
  "title": null,
  "collection": "Eligibility Traces",
  "area": "Reinforcement Learning"
}
{
  "name": "DMAGE",
  "full_name": "Unsupervised Deep Manifold Attributed Graph Embedding",
  "description": "Unsupervised attributed graph representation learning is challenging since both structural and feature information are required to be represented in the latent space. Existing methods concentrate on learning latent representation via reconstruction tasks, but cannot directly optimize representation and are prone to oversmoothing, thus limiting the applications on downstream tasks. To alleviate these issues, we propose a novel graph embedding framework named Deep Manifold Attributed Graph Embedding (DMAGE). A node-to-node geodesic similarity is proposed to compute the inter-node similarity between the data space and the latent space and then use Bergman divergence as loss function to minimize the difference between them. We then design a new network structure with fewer aggregation to alleviate the oversmoothing problem and incorporate graph structure augmentation to improve the representation's stability. Our proposed DMAGE surpasses state-of-the-art methods by a significant margin on three downstream tasks: unsupervised visualization, node clustering, and link prediction across four popular datasets.",
  "title": null,
  "collection": "Clustering",
  "area": "General"
}
{
  "name": "Dynamic SmoothL1 Loss",
  "full_name": "Dynamic SmoothL1 Loss",
  "description": "**Dynamic SmoothL1 Loss (DSL)** is a loss function in object detection where we change the shape of loss function to gradually focus on high quality samples:\r\n\r\n$$\\text{DSL}\\left(x, \\beta\\_{now}\\right) = 0.5|{x}|^{2}/\\beta\\_{now}, \\text{ if } |x| < \\beta\\_{now}\\text{,} $$ \r\n$$\\text{DSL}\\left(x, \\beta\\_{now}\\right) = |{x}| - 0.5\\beta\\_{now}\\text{, otherwise} $$ \r\n\r\nDSL will change the value of $\\beta\\_{now}$ according to the statistics of regression errors which can reflect the localization accuracy. It was introduced as part of the [Dynamic R-CNN](https://paperswithcode.com/method/dynamic-r-cnn) model.",
  "title": "Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "CoLU",
  "full_name": "Collapsing Linear Unit",
  "description": "CoLU is an activation function similar to Swish and Mish in properties. It is defined as:\r\n$$f(x)=\\frac{x}{1-x^{-(x+e^x)}}$$\r\nIt is smooth, continuously differentiable, unbounded above, bounded below, non-saturating, and non-monotonic. Based on experiments done with CoLU with different activation functions, it is observed that CoLU usually performs better than other functions on deeper neural networks.",
  "title": "Deeper Learning with CoLU Activation",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "E-MBConv",
  "full_name": "E-MBConv",
  "description": "",
  "title": "EfficientPose: Scalable single-person pose estimation",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "PSFR-GAN",
  "full_name": "PSFR-GAN",
  "description": "**PSFR-GAN** is a semantic-aware style transformation framework for face restoration. Given a pair of LQ face image and its corresponding parsing map, we first generate a multi-scale pyramid of the inputs, and then progressively modulate different scale features from coarse-to-fine in a semantic-aware style transfer way. Compared with previous networks, the proposed PSFR-GAN makes full use of the semantic (parsing maps) and pixel (LQ images) space information from different scales of inputs.",
  "title": "Progressive Semantic-Aware Style Transformation for Blind Face Restoration",
  "collection": "Generative Adversarial Networks",
  "area": "Computer Vision"
}
{
  "name": "ShapeConv",
  "full_name": "ShapeConv",
  "description": "**ShapeConv**, or **Shape-aware Convolutional layer**, is a convolutional layer for processing the depth feature in indoor RGB-D semantic segmentation. The depth feature is firstly decomposed into a shape-component and a base-component, next two learnable weights are introduced to cooperate with them independently, and finally a [convolution](https://paperswithcode.com/method/convolution) is applied on the re-weighted combination of these two components.",
  "title": "ShapeConv: Shape-aware Convolutional Layer for Indoor RGB-D Semantic Segmentation",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "b2b transfer learning",
  "full_name": "building to building transfer learning",
  "description": "using transfer learning to transfer knowledge from one building to predict the energy consumption of another building with scarce data",
  "title": "Transfer Learning in Deep Learning Models for Building Load Forecasting: Case of Limited Data",
  "collection": "Imitation Learning Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "LFPNet (TTA)",
  "full_name": "LFPNet with test time augmentation",
  "description": "",
  "title": "$F$, $B$, Alpha Matting",
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "HardELiSH",
  "full_name": "HardELiSH",
  "description": "**HardELiSH** is an activation function for neural networks.  The HardELiSH is a multiplication of the [HardSigmoid](https://paperswithcode.com/method/hard-sigmoid) and [ELU](https://paperswithcode.com/method/elu) in the negative part and a multiplication of the Linear and the HardSigmoid in the positive\r\npart:\r\n\r\n$$f\\left(x\\right) = x\\max\\left(0, \\min\\left(1, \\left(\\frac{x+1}{2}\\right)\\right) \\right) \\text{ if } x \\geq 1$$\r\n$$f\\left(x\\right) = \\left(e^{x}-1\\right)\\max\\left(0, \\min\\left(1, \\left(\\frac{x+1}{2}\\right)\\right)\\right) \\text{ if } x < 0 $$\r\n\r\nSource: [Activation Functions](https://arxiv.org/pdf/1811.03378.pdf)",
  "title": "The Quest for the Golden Activation Function",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "SegFormer",
  "full_name": "SegFormer",
  "description": "**SegFormer** is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based framework for semantic segmentation that unifies Transformers with lightweight [multilayer perceptron](https://paperswithcode.com/method/feedforward-network) (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations.",
  "title": "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "FAVOR+",
  "full_name": "Fast Attention Via Positive Orthogonal Random Features",
  "description": "**FAVOR+**, or **Fast Attention Via Positive Orthogonal Random Features**, is an efficient attention mechanism used in the [Performer](https://paperswithcode.com/method/performer) architecture which leverages approaches such as kernel methods and random features approximation for approximating [softmax](https://paperswithcode.com/method/softmax) and Gaussian kernels. \r\n\r\nFAVOR+ works for attention blocks using matrices $\\mathbf{A} \\in \\mathbb{R}^{L×L}$ of the form $\\mathbf{A}(i, j) = K(\\mathbf{q}\\_{i}^{T}, \\mathbf{k}\\_{j}^{T})$, with $\\mathbf{q}\\_{i}/\\mathbf{k}\\_{j}$ standing for the $i^{th}/j^{th}$ query/key row-vector in $\\mathbf{Q}/\\mathbf{K}$ and kernel $K : \\mathbb{R}^{d } × \\mathbb{R}^{d} \\rightarrow \\mathbb{R}\\_{+}$ defined for the (usually randomized) mapping: $\\phi : \\mathbb{R}^{d } → \\mathbb{R}^{r}\\_{+}$ (for some $r > 0$) as:\r\n\r\n$$K(\\mathbf{x}, \\mathbf{y}) = E[\\phi(\\mathbf{x})^{T}\\phi(\\mathbf{y})] $$\r\n\r\nWe call $\\phi(\\mathbf{u})$ a random feature map for $\\mathbf{u} \\in \\mathbb{R}^{d}$ . For $\\mathbf{Q}^{'}, \\mathbf{K}^{'} \\in \\mathbb{R}^{L \\times r}$ with rows given as $\\phi(\\mathbf{q}\\_{i}^{T})^{T}$ and $\\phi(\\mathbf{k}\\_{i}^{T})^{T}$  respectively, this leads directly to the efficient attention mechanism of the form:\r\n\r\n$$ \\hat{Att\\_{\\leftrightarrow}}\\left(\\mathbf{Q}, \\mathbf{K}, \\mathbf{V}\\right) = \\hat{\\mathbf{D}}^{-1}(\\mathbf{Q^{'}}((\\mathbf{K^{'}})^{T}\\mathbf{V}))$$\r\n\r\nwhere\r\n\r\n$$\\mathbf{\\hat{D}} = \\text{diag}(\\mathbf{Q^{'}}((\\mathbf{K^{'}})\\mathbf{1}\\_{L})) $$\r\n\r\nThe above scheme constitutes the [FA](https://paperswithcode.com/method/dfa)-part of the FAVOR+ mechanism. The other parts are achieved by:\r\n\r\n- The R part :  The softmax kernel is approximated though trigonometric functions, in the form of a regularized softmax-kernel SMREG, that employs positive random features (PRFs).\r\n- The OR+ part : To reduce the variance of the estimator, so we can use a smaller number of random features, different samples are entangled to be exactly orthogonal using the Gram-Schmidt orthogonalization procedure.\r\n\r\nThe details are quite technical, so it is recommended you read the paper for further information on these steps.",
  "title": "Rethinking Attention with Performers",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "U2-Net",
  "full_name": "U2-Net",
  "description": "**U2-Net** is a two-level nested U-structure architecture that is designed for salient object detection (SOD).  The architecture allows the network to go deeper, attain high resolution, without significantly increasing the memory and computation cost. This is achieved by a nested U-structure: on the bottom level, with a novel ReSidual U-block (RSU) module, which is able to extract intra-stage multi-scale features without degrading the feature map resolution; on the top level, there is a [U-Net](https://paperswithcode.com/method/u-net) like structure, in which each stage is filled by a RSU block.",
  "title": "U$^2$-Net: Going Deeper with Nested U-Structure for Salient Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "Conditional DBlock",
  "full_name": "Conditional DBlock",
  "description": "**Conditional DBlock** is a residual based block used in the discriminator of the [GAN-TTS](https://paperswithcode.com/method/gan-tts) architecture. They are similar to the [GBlocks](https://paperswithcode.com/method/gblock) used in the generator, but without [batch normalization](https://paperswithcode.com/method/batch-normalization). Unlike the [DBlock](https://paperswithcode.com/method/dblock), the Conditional DBlock adds the embedding of the linguistic features after the first [convolution](https://paperswithcode.com/method/convolution).",
  "title": "High Fidelity Speech Synthesis with Adversarial Networks",
  "collection": "Audio Model Blocks",
  "area": "Audio"
}
{
  "name": "U-CAM",
  "full_name": "Uncertainty Class Activation Map (U-CAM) Using Gradient Certainty Method",
  "description": "Understanding and explaining deep learning models is an imperative task. Towards this, we propose a method that obtains gradient-based certainty estimates that also provide [visual attention](https://paperswithcode.com/method/visual-attention) maps. Particularly, we solve for visual question answering task. We incorporate modern probabilistic deep learning methods that we further improve by using the gradients for these estimates. These have two-fold benefits: a) improvement in obtaining the certainty estimates that correlate better with misclassified samples and b) improved attention maps that provide state-of-the-art results in terms of correlation with human attention regions. The improved attention maps result in consistent improvement for various methods for visual question answering. Therefore, the proposed technique can be thought of as a tool for obtaining improved certainty estimates and explanations for deep learning models.",
  "title": null,
  "collection": "VQA Models",
  "area": "Computer Vision"
}
{
  "name": "Dilated Bottleneck with Projection Block",
  "full_name": "Dilated Bottleneck with Projection Block",
  "description": "**Dilated Bottleneck with Projection Block** is an image model block used in the [DetNet](https://paperswithcode.com/method/detnet) convolutional neural network architecture. It employs a bottleneck structure with dilated convolutions to efficiently enlarge the receptive field. It uses a [1x1 convolution](https://paperswithcode.com/method/1x1-convolution) to ensure the spatial size stays fixed.",
  "title": "DetNet: A Backbone network for Object Detection",
  "collection": "Skip Connection Blocks",
  "area": "General"
}
{
  "name": "SqueezeNeXt",
  "full_name": "SqueezeNeXt",
  "description": "**SqueezeNeXt** is a type of convolutional neural network that uses the [SqueezeNet](https://paperswithcode.com/method/squeezenet) architecture as a baseline, but makes a number of changes. First, a more aggressive channel reduction is used by incorporating a two-stage squeeze module. This significantly reduces the total number of parameters used with the 3×3 convolutions. Secondly, it uses separable 3 × 3 convolutions to further reduce the model size, and removes the additional 1×1 branch after the squeeze module. Thirdly, the network use an element-wise addition skip connection similar to that of [ResNet](https://paperswithcode.com/method/resnet) architecture.",
  "title": "SqueezeNext: Hardware-Aware Neural Network Design",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "DetNASNet",
  "full_name": "DetNASNet",
  "description": "**DetNASNet** is a convolutional neural network designed to be an object detection backbone and discovered through [DetNAS](https://paperswithcode.com/method/detnas) architecture search. It uses [ShuffleNet V2](https://paperswithcode.com/method/shufflenet-v2) blocks as its basic building block.",
  "title": "DetNAS: Backbone Search for Object Detection",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "ALS",
  "full_name": "Adaptive Label Smoothing",
  "description": "",
  "title": "Efficient Model for Image Classification With Regularization Tricks",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Manifold Mixup",
  "full_name": "Manifold Mixup",
  "description": "**Manifold Mixup** is a regularization method that encourages neural networks to predict less confidently on interpolations of hidden representations. It leverages semantic interpolations as an additional training signal, obtaining neural networks with smoother decision boundaries at multiple levels of representation. As a result, neural networks trained with Manifold Mixup learn class-representations with fewer directions of variance.\r\n\r\nConsider training a deep neural network $f\\left(x\\right) = f\\_{k}\\left(g\\_{k}\\left(x\\right)\\right)$, where $g\\_{k}$ denotes the part of the neural network mapping the input data to the hidden representation at layer $k$, and $f\\_{k}$ denotes the\r\npart mapping such hidden representation to the output $f\\left(x\\right)$. Training $f$ using Manifold Mixup is performed in five steps:\r\n\r\n(1) Select a random layer $k$ from a set of eligible layers $S$ in the neural network. This set may include the input layer $g\\_{0}\\left(x\\right)$.\r\n\r\n(2) Process two random data minibatches $\\left(x, y\\right)$ and $\\left(x', y'\\right)$ as usual, until reaching layer $k$. This provides us with two intermediate minibatches $\\left(g\\_{k}\\left(x\\right), y\\right)$ and $\\left(g\\_{k}\\left(x'\\right), y'\\right)$.\r\n\r\n(3) Perform Input [Mixup](https://paperswithcode.com/method/mixup) on these intermediate minibatches. This produces the mixed minibatch:\r\n\r\n$$\r\n\\left(\\tilde{g}\\_{k}, \\tilde{y}\\right) = \\left(\\text{Mix}\\_{\\lambda}\\left(g\\_{k}\\left(x\\right), g\\_{k}\\left(x'\\right)\\right), \\text{Mix}\\_{\\lambda}\\left(y, y'\\right\r\n)\\right),\r\n$$\r\n\r\nwhere $\\text{Mix}\\_{\\lambda}\\left(a, b\\right) = \\lambda \\cdot a + \\left(1 − \\lambda\\right) \\cdot b$. Here, $\\left(y, y'\r\n\\right)$ are one-hot labels, and the mixing coefficient\r\n$\\lambda \\sim \\text{Beta}\\left(\\alpha, \\alpha\\right)$ as in mixup. For instance, $\\alpha = 1.0$ is equivalent to sampling $\\lambda \\sim U\\left(0, 1\\right)$.\r\n\r\n(4) Continue the forward pass in the network from layer $k$ until the output using the mixed minibatch $\\left(\\tilde{g}\\_{k}, \\tilde{y}\\right)$.\r\n\r\n(5) This output is used to compute the loss value and\r\ngradients that update all the parameters of the neural network.",
  "title": "Manifold Mixup: Better Representations by Interpolating Hidden States",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "LXMERT",
  "full_name": "Learning Cross-Modality Encoder Representations from Transformers",
  "description": "LXMERT is a model for learning vision-and-language cross-modality representations. It consists of a Transformer model that consists three encoders: object relationship encoder, a language encoder, and a cross-modality encoder. The model takes two inputs: image with its related sentence. The images are represented as a sequence of objects, whereas each sentence is represented as sequence of words. By combining the self-attention and cross-attention layers the model is able to generated language representation, image representations, and cross-modality representations from the input. The model is pre-trained with image-sentence pairs via five pre-training tasks: masked language modeling, masked object prediction, cross-modality matching, and image questions answering. These tasks help the model to learn both intra-modality and cross-modality relationships.",
  "title": "LXMERT: Learning Cross-Modality Encoder Representations from Transformers",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "WIPA",
  "full_name": "Wavelet-integrated Identity Preserving Adversarial Network for face super-resolution",
  "description": "# WIPA: Wavelet-integrated, Identity Preserving, Adversarial network for Face Super-resolution\r\nPytorch implementation of WIPA: Super-resolution of very low-resolution face images with a **W**avelet Integrated, **I**dentity **P**reserving, **A**dversarial Network. \r\n# Paper:\r\n[Super-resolution of very low-resolution face images with a Wavelet Integrated, Identity Preserving, Adversarial Network](https://www.sciencedirect.com/science/article/abs/pii/S0923596522000753?dgcid=coauthor).\r\nYou can download the pre-proof version of the article [here](https://drive.google.com/file/d/1GHWiCcScPF1PK4xozoRf-88Rytom-kvl/view?usp=sharing) but  please refer to the origital manuscript for citation.\r\n## Citation\r\nIf you find this work useful for your research, please consider citing our paper:\r\n```\r\n@article{DASTMALCHI2022116755,\r\ntitle = {Super-resolution of very low-resolution face images with a wavelet integrated, identity preserving, adversarial network},\r\njournal = {Signal Processing: Image Communication},\r\nvolume = {107},\r\npages = {116755},\r\nyear = {2022},\r\nissn = {0923-5965},\r\ndoi = {https://doi.org/10.1016/j.image.2022.116755},\r\nurl = {https://www.sciencedirect.com/science/article/pii/S0923596522000753},\r\nauthor = {Hamidreza Dastmalchi and Hassan Aghaeinia},\r\nkeywords = {Super-resolution, Wavelet prediction, Generative Adversarial Networks, Face Hallucination, Identity preserving, Perceptual quality},\r\n```\r\n## Linkdin Profile:\r\n**Hamidreza Dastmalchi linkdin profile:**\r\nhttps://www.linkedin.com/in/hamidreza-dastmalchi-80bb4574/\r\n## WIPA Algorithm\r\nwe present **Wavelet\r\nPrediction blocks** attached to a **Baseline CNN network** to predict wavelet missing details of facial images. The\r\nextracted wavelet coefficients are concatenated with original feature maps in different scales to recover fine\r\ndetails. Unlike other wavelet-based FH methods, this algorithm exploits the wavelet-enriched feature maps as\r\ncomplementary information to facilitate the hallucination task. We introduce a **wavelet prediction loss** to push\r\nthe network to generate wavelet coefficients. In addition to the wavelet-domain cost function, a combination of\r\n**perceptual**, **adversarial**, and **identity loss** functions has been utilized to achieve low-distortion and perceptually\r\nhigh-quality images while maintaining identity. The training scheme of the Wavelet-Integrated network with the combination of five loss terms is shown as below:\r\n<p align=\"center\">\r\n  <img width=\"500\" src=\"./block-diagram/WIPA-Training-Scheme.jpg\">\r\n</p>\r\n\r\n## Datasets\r\nThe [Celebrity dataset](https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) used for training the proposed FH algorithm. The database contains more than 200 K different face images under significant pose, illumination, and expression variations. In our experiment, two distinct groups of 20 thousand images are randomly selected from the CelebA dataset as our train and test dataset. In order to test the generalizing capacity of the method, we have further evaluated the performance of the proposed approach on [FW](http://vis-www.cs.umass.edu/lfw/) and [Helen dataset](http://www.ifp.illinois.edu/~vuongle2/helen/) too. All the testing and training images are roughly aligned using similarity transformation with landmarks detected by the well-known MTCNN network. The images are rescaled to the size of 128 × 128. The corresponding LR images are also constructed by down-sampling the HR images using bicubic interpolation. The experiments are accomplished in two different **scaling** factors of 8X and 16X with LR images of size 16 × 16 and 8 × 8, respectively.\r\n **Before starting to train or test the network**, you must put the training images in the corresponding folders:\r\n- Put training images in “.\\data\\train” directory.\r\n- Put celeba test images in “.\\data\\test\\celeba” , lfw test images in “.\\data\\test\\lfw” and helen test images in “.\\data\\test\\helen”.\r\n\r\n## Pretrained Weights\r\nThe pretrained weights can be downloaded [here](https://drive.google.com/drive/folders/18V1kPDHW6F05L0xOOODNHZHO566SA6iC?usp=sharing).\r\n\r\n## Code\r\nThe codes are consisted of two main files: the **main.py** file for training the network and the **test.py** file for evaluating the algorithm with different metrics like PSNR, SSIM and verification rate.\r\n### Training \r\nTo train the network, simply run this code in Anaconda terminal:\r\n```\r\n>>python main.py\r\n```\r\nWe designed different input arguments for controlling the training procedure. Please use --help command to see the available input arguments. \r\n\r\n#### Example: \r\nFor example, to train the wavelet-integrated network through GPU with scale factor of 8, without having pre-trained model coefficients, with learning rate of 5e-5, you can simply run the following code in the terminal:\r\n```\r\npython main.py –scale 8 –wi_net “” –disc_net “” –wavelet_integrated True –lr 0.00005\r\n```\r\n\r\n### Testing\r\nfor evaluating (testing), simply run the following code in terminal:\r\n```\r\n>>python test.py\r\n```\r\nWe have also developed different options as input arguments to control the testing procedure. You can evaluate psnr, ssim, fid score and also verification rate by the “test.py” file. To do this, you have to put the test images in the corresponding folders in data root at first.\r\n\r\n#### Example: \r\nFor example, to evaluate the psnr and ssim of a wavelet-integrated pretrained model in scale of 8 and save the super-resolved results in folder of “./results/celeba”, you can write the following code in the command window:\r\n```\r\n>> test.py --wavelet_integrated True --scale 8 --wi_net gen_net_8x --save_flag True --save_folder ./results/celeba --metrics psnr ssim\r\n```\r\nTo estimate the fid score, you have to produce the super-resolved test images first. Therefore, if you have not generated the super-resolved images, you have to call –metrics psnr ssim with fid simultaneously. You can also add the acc option to the metrics to evaluate the verification rate of the model:\r\n```\r\n>>python test.py --wavelet_integrated True --scale 8 --wi_net gen_net_8x --save_flag True --save_folder ./results/celeba --metrics psnr ssim fid acc\r\n```\r\n### Demo \r\nIn addition, we have developed a “demo.py” python file to demonstrate the results of some sample images in the “./sample_images/gt” directory. To run the demo file, simply write the following code in the terminal:\r\n```\r\n>>python demo.py\r\n```\r\nBy default, the images of “./sample_images/gt” folder will be super-resolved by the wavelet-integrated network by a scale factor of 8 and the results will be saved in the “./sample_images/sr” folder. To change the scaling factor, one must alter not only the –scale option but also the corresponding –wi_net argument to import the relevant pretrained state dictionary.",
  "title": null,
  "collection": "Face Restoration Models",
  "area": "Computer Vision"
}
{
  "name": "Contextualized Topic Models",
  "full_name": "Contextualized Topic Models",
  "description": "Contextualized Topic Models are based on the Neural-ProdLDA variational autoencoding approach by Srivastava and Sutton (2017). \r\n\r\nThis approach trains an encoding neural network to map pre-trained contextualized word embeddings (e.g., [BERT](https://paperswithcode.com/method/bert)) to latent representations. Those latent representations are sampled variationally from a Gaussian distribution $N(\\mu, \\sigma^2)$ and passed to a decoder network that has to reconstruct the document bag-of-word representation.",
  "title": "Cross-lingual Contextualized Topic Models with Zero-shot Learning",
  "collection": "Topic Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "Neural Turing Machine",
  "full_name": "Neural Turing Machine",
  "description": "A **Neural Turing Machine** is a working memory neural network model. It couples a neural network architecture with external memory resources. The whole architecture is differentiable end-to-end with gradient descent. The models can infer tasks such as copying, sorting and associative recall.\r\n\r\nA Neural Turing Machine (NTM) architecture contains two basic components: a neural\r\nnetwork controller and a memory bank. The Figure presents a high-level diagram of the NTM\r\narchitecture. Like most neural networks, the controller interacts with the external world via\r\ninput and output vectors. Unlike a standard network, it also interacts with a memory matrix\r\nusing selective read and write operations. By analogy to the Turing machine we refer to the\r\nnetwork outputs that parameterise these operations as “heads.”\r\n\r\nEvery component of the architecture is differentiable. This is achieved by defining 'blurry' read and write operations that interact to a greater or lesser degree with all the elements in memory (rather\r\nthan addressing a single element, as in a normal Turing machine or digital computer). The\r\ndegree of blurriness is determined by an attentional “focus” mechanism that constrains each\r\nread and write operation to interact with a small portion of the memory, while ignoring the\r\nrest. Because interaction with the memory is highly sparse, the NTM is biased towards\r\nstoring data without interference. The memory location brought into attentional focus is\r\ndetermined by specialised outputs emitted by the heads. These outputs define a normalised\r\nweighting over the rows in the memory matrix (referred to as memory “locations”). Each\r\nweighting, one per read or write head, defines the degree to which the head reads or writes\r\nat each location. A head can thereby attend sharply to the memory at a single location or\r\nweakly to the memory at many locations",
  "title": "Neural Turing Machines",
  "collection": "Working Memory Models",
  "area": "General"
}
{
  "name": "Submanifold Convolution",
  "full_name": "Submanifold Convolution",
  "description": "**Submanifold Convolution (SC)** is a spatially sparse [convolution](https://paperswithcode.com/method/convolution) operation used for tasks with sparse data like semantic segmentation of 3D point clouds. An SC convolution computes the set of active sites in the same way as a regular convolution: it looks for the presence of any active sites in its receptive field of size $f^{d}$. If the input has size $l$ then the output will have size $\\left(l − f + s\\right)/s$. Unlike a regular convolution, an SC convolution discards the ground state for non-active sites by assuming that the input from those sites is zero. For more details see the [paper](https://paperswithcode.com/paper/3d-semantic-segmentation-with-submanifold), or the official code [here](https://github.com/facebookresearch/SparseConvNet).",
  "title": "3D Semantic Segmentation with Submanifold Sparse Convolutional Networks",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "FuseFormer",
  "full_name": "FuseFormer",
  "description": "**FuseFormer** is a [Transformer](https://paperswithcode.com/method/transformer)-based model designed for video inpainting via fine-grained feature fusion based on novel [Soft Split and Soft Composition](https://paperswithcode.com/method/soft-split-and-soft-composition) operations. The soft split divides feature map into many patches with given overlapping interval while the soft composition stitches them back into a whole feature map where pixels in overlapping regions are summed up. FuseFormer builds soft composition and soft split into its [feedforward network](https://paperswithcode.com/method/feedforward-network) for further enhancing subpatch level feature fusion.",
  "title": "FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting",
  "collection": "Generative Video Models",
  "area": "Computer Vision"
}
{
  "name": "High-resolution input",
  "full_name": "High-resolution input",
  "description": "",
  "title": "EfficientPose: Scalable single-person pose estimation",
  "collection": "Image Representations",
  "area": "Computer Vision"
}
{
  "name": "Positional Encoding Generator",
  "full_name": "Positional Encoding Generator",
  "description": "**Positional Encoding Generator**, or **PEG**, is a module used in the [Conditional Position Encoding](https://paperswithcode.com/method/conditional-positional-encoding) position embeddings. It dynamically produce the positional encodings conditioned on the local neighborhood of an input token. To condition on the local neighbors, we first reshape the flattened input sequence $X \\in \\mathbb{R}^{B \\times N \\times C}$ of DeiT back to $X^{\\prime} \\in \\mathbb{R}^{B \\times H \\times W \\times C}$ in the 2 -D image space. Then, a function (denoted by $\\mathcal{F}$ in the Figure) is repeatedly applied to the local patch in $X^{\\prime}$ to produce the conditional positional encodings $E^{B \\times H \\times W \\times C} .$ PEG can be efficiently implemented with a 2-D convolution with kernel $k(k \\geq 3)$ and $\\frac{k-1}{2}$ zero paddings. Note that the zero paddings here are important to make the model be aware of the absolute positions, and $\\mathcal{F}$ can be of various forms such as separable convolutions and many others.",
  "title": "Conditional Positional Encodings for Vision Transformers",
  "collection": "Miscellaneous Components",
  "area": "General"
}
{
  "name": "BASNet",
  "full_name": "Boundary-Aware Segmentation Network",
  "description": "**BASNet**, or **Boundary-Aware Segmentation Network**, is an image segmentation architecture that consists of a predict-refine architecture and a hybrid loss. The proposed BASNet comprises a predict-refine architecture and a hybrid loss, for highly accurate image segmentation.  The predict-refine architecture consists of a densely supervised encoder-decoder network and a residual \r\n refinement module, which are respectively used to predict and refine a segmentation probability map. The hybrid loss is a combination of the binary cross entropy, structural similarity and intersection-over-union losses, which guide the network to learn three-level (i.e., pixel-, patch- and map- level) hierarchy representations.",
  "title": "Boundary-Aware Segmentation Network for Mobile and Web Applications",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "Probabilistic Anchor Assignment",
  "full_name": "Probabilistic Anchor Assignment",
  "description": "**Probabilistic anchor assignment (PAA)** adaptively separates a set of anchors into positive and negative samples for a GT box according to the learning status of the model associated with it. To do so we first define a score of a detected bounding box that reflects both the classification and localization qualities. We then identify the connection between this score and the training objectives and represent the score as the combination of two loss objectives. Based on this scoring scheme, we calculate the scores of individual anchors that reflect how the model finds useful cues to detect a target object in each anchor. With these anchor scores, we aim to find a probability distribution of two modalities that best represents the scores as positive or negative samples as in the Figure. \r\n\r\nUnder the found probability distribution, anchors with probabilities from the positive component are high are selected as positive samples. This transforms the anchor assignment problem to a maximum likelihood estimation for a probability distribution where the parameters of the distribution is determined by anchor scores. Based on the assumption that anchor scores calculated by the model are samples drawn from a probability distribution, it is expected that the model can infer the sample separation in a probabilistic way, leading to easier training of the model compared to other non-probabilistic assignments. Moreover, since positive samples are adaptively selected based on the anchor score distribution, it does not require a pre-defined number of positive samples nor an IoU threshold.",
  "title": "Probabilistic Anchor Assignment with IoU Prediction for Object Detection",
  "collection": "Anchor Generation Modules",
  "area": "Computer Vision"
}
{
  "name": "FlexFlow",
  "full_name": "FlexFlow",
  "description": "**FlexFlow** is a deep learning engine that uses guided randomized search of the SOAP (Sample, Operator, Attribute, and Parameter) space to find a fast parallelization strategy for a specific parallel machine. To accelerate this search, FlexFlow introduces a novel execution simulator that can accurately predict a parallelization strategy’s performance and is three orders of magnitude faster than prior approaches that execute each strategy. \r\n\r\nFlexFlow uses two main components: a fast, incremental execution simulator to evaluate different parallelization strategies, and a Markov Chain Monte Carlo (MCMC) search algorithm that takes advantage of the incremental simulator to rapidly explore the large search space.",
  "title": null,
  "collection": "Distributed Methods",
  "area": "General"
}
{
  "name": "KAF",
  "full_name": "Kernel Activation Function",
  "description": "A **Kernel Activation Function** is a non-parametric activation function defined as a one-dimensional kernel approximator:\r\n\r\n$$ f(s) = \\sum_{i=1}^D \\alpha_i \\kappa( s, d_i) $$\r\n\r\nwhere:\r\n\r\n1. The dictionary of the kernel elements $d_0, \\ldots, d_D$ is fixed by sampling the $x$-axis with a uniform step around 0.\r\n2. The user selects the kernel function (e.g., Gaussian, [ReLU](https://paperswithcode.com/method/relu), [Softplus](https://paperswithcode.com/method/softplus)) and the number of kernel elements $D$ as a hyper-parameter. A larger dictionary leads to more expressive activation functions and a larger number of trainable parameters.\r\n3. The linear coefficients are adapted independently at every neuron via standard back-propagation.\r\n\r\nIn addition, the linear coefficients can be initialized using kernel ridge regression to behave similarly to a known function in the beginning of the optimization process.",
  "title": "Kafnets: kernel-based non-parametric activation functions for neural networks",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "DSGN",
  "full_name": "Deep Stereo Geometry Network",
  "description": "**Deep Stereo Geometry Network** is a 3D object detection pipeline that relies on space transformation from 2D features to an effective 3D structure, called 3D geometric volume (3DGV). The whole neural network consists of four components. (a) A 2D image\r\nfeature extractor for capture of both pixel- and high-level feature. (b) Constructing the plane-sweep volume and 3D geometric volume. (c) Depth Estimation on the plane-sweep volume. (d) 3D object detection on 3D geometric volume.",
  "title": "DSGN: Deep Stereo Geometry Network for 3D Object Detection",
  "collection": "3D Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "Temporal Distribution Characterization",
  "full_name": "Temporal Distribution Characterization",
  "description": "**Temporal Distribution Characterization**, or **TDC**, is a module used in the [AdaRNN](https://paperswithcode.com/method/adarnn) architecture to characterize the distributional information in a time series.\r\n\r\nBased on the principle of maximum entropy, maximizing the utilization of shared knowledge underlying a times series under temporal covariate shift can be done by finding periods which are most dissimilar to each other, which is also considered as the worst case of temporal covariate shift since the cross-period distributions are the most diverse. TDC achieves this goal for splitting the time-series by solving an optimization problem whose objective can be formulated as:\r\n\r\n$$\r\n\\max \\_{0<K \\leq K\\_{0}} \\max \\_{n\\_{1}, \\cdots, n\\_{K}} \\frac{1}{K} \\sum_{1 \\leq i \\neq j \\leq K} d\\left(\\mathcal{D}\\_{i}, \\mathcal{D}\\_{j}\\right) \r\n$$\r\n\r\n$$\r\n\\text { s.t. } \\forall i, \\Delta_{1}<\\left|\\mathcal{D}\\_{i}\\right|<\\Delta_{2} ; \\sum_{i}\\left|\\mathcal{D}\\_{i}\\right|=n\r\n$$\r\n\r\nwhere $d$ is a distance metric, $\\Delta\\_{1}$ and $\\Delta\\_{2}$ are predefined parameters to avoid trivial solutions (e.g., very small values or very large values may fail to capture the distribution information), and $K\\_{0}$ is the hyperparameter to avoid over-splitting. The metric $d(\\cdot, \\cdot)$ above can be any distance function, e.g., Euclidean or Editing distance, or some distribution-based distance / divergence, like MMD [14] and KL-divergence.\r\n\r\nThe learning goal of the optimization problem (1) is to maximize the averaged period-wise distribution distances by searching $K$ and the corresponding periods so that the distributions of each period are as diverse as possible and the learned prediction model has better a more generalization ability.",
  "title": "AdaRNN: Adaptive Learning and Forecasting of Time Series",
  "collection": "Time Series Modules",
  "area": "Sequential"
}
{
  "name": "Label Smoothing",
  "full_name": "Label Smoothing",
  "description": "**Label Smoothing** is a regularization technique that introduces noise for the labels. This accounts for the fact that datasets may have mistakes in them, so maximizing the likelihood of $\\log{p}\\left(y\\mid{x}\\right)$ directly can be harmful. Assume for a small constant $\\epsilon$, the training set label $y$ is correct with probability $1-\\epsilon$ and incorrect otherwise. Label Smoothing regularizes a model based on a [softmax](https://paperswithcode.com/method/softmax) with $k$ output values by replacing the hard $0$ and $1$ classification targets with targets of $\\frac{\\epsilon}{k-1}$ and $1-\\epsilon$ respectively.\r\n\r\nSource: Deep Learning, Goodfellow et al\r\n\r\nImage Source: [When Does Label Smoothing Help?](https://arxiv.org/abs/1906.02629)",
  "title": null,
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "SuperpixelGridMasks",
  "full_name": "SuperpixelGridCut, SuperpixelGridMean, SuperpixelGridMix",
  "description": "Karim Hammoudi, Adnane Cabani, Bouthaina Slika, Halim Benhabiles, Fadi Dornaika and Mahmoud Melkemi. SuperpixelGridCut, SuperpixelGridMean and SuperpixelGridMix Data Augmentation, arXiv:2204.08458, 2022. https://doi.org/10.48550/arxiv.2204.08458",
  "title": null,
  "collection": "Image Data Augmentation",
  "area": "Computer Vision"
}
{
  "name": "Max Pooling",
  "full_name": "Max Pooling",
  "description": "**Max Pooling** is a pooling operation that calculates the maximum value for patches of a feature map, and uses it to create a downsampled (pooled) feature map.  It is usually used after a convolutional layer. It adds a small amount of translation invariance - meaning translating the image by a small amount does not significantly affect the values of most pooled outputs.\r\n\r\nImage Source: [here](https://computersciencewiki.org/index.php/File:MaxpoolSample2.png)",
  "title": null,
  "collection": "Pooling Operations",
  "area": "Computer Vision"
}
{
  "name": "PocketNet",
  "full_name": "PocketNet",
  "description": "**PocketNet** is a face recognition model family discovered through [neural architecture search](https://paperswithcode.com/methods/category/neural-architecture-search). The training is based on multi-step knowledge distillation.",
  "title": "PocketNet: Extreme Lightweight Face Recognition Network using Neural Architecture Search and Multi-Step Knowledge Distillation",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "Spatial Attention Module (ThunderNet)",
  "full_name": "Spatial Attention Module (ThunderNet)",
  "description": "**Spatial Attention Module (SAM)** is a feature extraction module for object detection used in [ThunderNet](https://paperswithcode.com/method/thundernet).\r\n\r\nThe ThunderNet SAM explicitly re-weights the feature map before RoI warping over the spatial dimensions. The key idea of SAM is to use the knowledge from [RPN](https://paperswithcode.com/method/rpn) to refine the feature distribution of the feature map. RPN is trained to recognize foreground regions under the supervision of ground truths. Therefore, the intermediate features in RPN can be used to distinguish foreground features from background features. SAM accepts two inputs: the intermediate feature map from RPN $\\mathcal{F}^{RPN}$ and the thin feature map from the [Context Enhancement Module](https://paperswithcode.com/method/context-enhancement-module) $\\mathcal{F}^{CEM}$. The output of SAM $\\mathcal{F}^{SAM}$ is defined as:\r\n\r\n$$ \\mathcal{F}^{SAM} = \\mathcal{F}^{CEM} * \\text{sigmoid}\\left(\\theta\\left(\\mathcal{F}^{RPN}\\right)\\right) $$\r\n\r\nHere $\\theta\\left(·\\right)$ is a dimension transformation to match the number of channels in both feature maps. The sigmoid function is used to constrain the values within $\\left[0, 1\\right]$. At last, $\\mathcal{F}^{CEM}$ is re-weighted by the generated feature map for better feature distribution. For computational efficiency, we simply apply a 1×1 [convolution](https://paperswithcode.com/method/convolution) as $\\theta\\left(·\\right)$, so the computational cost of CEM is negligible. The Figure to the right shows the structure of SAM. \r\n\r\nSAM has two functions. The first one is to refine the feature distribution by strengthening foreground features and suppressing background features. The second one is to stabilize the training of RPN as SAM enables extra gradient flow from [R-CNN](https://paperswithcode.com/method/r-cnn) subnet to RPN. As a result, RPN receives additional supervision from RCNN subnet, which helps the training of RPN.",
  "title": "ThunderNet: Towards Real-time Generic Object Detection",
  "collection": "Feature Extractors",
  "area": "Computer Vision"
}
{
  "name": "ParaNet",
  "full_name": "ParaNet",
  "description": "**ParaNet** is a non-autoregressive attention-based architecture for text-to-speech, which is fully convolutional and converts text to mel spectrogram. ParaNet distills the attention from the autoregressive text-to-spectrogram model, and iteratively refines the alignment between text and spectrogram in a layer-by-layer manner. The architecture is otherwise similar to [Deep Voice 3](https://paperswithcode.com/method/deep-voice-3) except these changes to the decoder; whereas the decoder of DV3 has multiple attention-based layers, where each layer consists of a\r\n[causal convolution](https://paperswithcode.com/method/causal-convolution) block followed by an attention block, ParaNet has a single attention block in the encoder.",
  "title": "Non-Autoregressive Neural Text-to-Speech",
  "collection": "Text-to-Speech Models",
  "area": "Audio"
}
{
  "name": "FastGCN",
  "full_name": "FastGCN",
  "description": "FastGCN is a fast improvement of the GCN model recently proposed by Kipf & Welling (2016a) for learning graph embeddings. It generalizes transductive training to an inductive manner and also addresses the memory bottleneck issue of GCN caused by recursive expansion of neighborhoods. The crucial ingredient is a sampling scheme in the reformulation of the loss and the gradient, well justified through an alternative view of graph convoluntions in the form of integral transforms of embedding functions.\r\n\r\nDescription and image from: [FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling](https://arxiv.org/pdf/1801.10247.pdf)",
  "title": "FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "GMVAE",
  "full_name": "Gaussian Mixture Variational Autoencoder",
  "description": "**GMVAE**, or **Gaussian Mixture Variational Autoencoder**, is a stochastic regularization layer for [transformers](https://paperswithcode.com/methods/category/transformers). A GMVAE layer is trained using a 700-dimensional internal representation of the first MLP layer. For every output from the first MLP layer, the GMVAE layer first computes a latent low-dimensional representation sampling from the GMVAE posterior distribution to then provide at the output a reconstruction sampled from a generative model.",
  "title": "Regularizing Transformers With Deep Probabilistic Layers",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "ENet",
  "full_name": "ENet",
  "description": "**ENet** is a semantic segmentation architecture which utilises a compact encoder-decoder architecture. Some design choices include:\r\n\r\n1. Using the [SegNet](https://paperswithcode.com/method/segnet) approach to downsampling y saving indices of elements chosen in max\r\npooling layers, and using them to produce sparse upsampled maps in the decoder.\r\n2.  Early downsampling to optimize the early stages of the network and reduce the cost of processing large input frames. The first two blocks of ENet heavily reduce the input size, and use only a small set of feature maps. \r\n3. Using PReLUs as an activation function\r\n4. Using dilated convolutions \r\n5. Using Spatial [Dropout](https://paperswithcode.com/method/dropout)",
  "title": "ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "MPRNet",
  "full_name": "MPRNet",
  "description": "**MPRNet** is a multi-stage progressive image restoration architecture that progressively learns restoration functions for the degraded inputs, thereby breaking down the overall recovery process into more manageable steps. Specifically, the model first learns the contextualized features using encoder-decoder architectures and later combines them with a high-resolution branch that retains local information. At each stage, a per-pixel adaptive design is introduced that leverages in-situ supervised attention to reweight the local features.",
  "title": "Multi-Stage Progressive Image Restoration",
  "collection": "Image Restoration Models",
  "area": "Computer Vision"
}
{
  "name": "Slot Attention",
  "full_name": "Slot Attention",
  "description": "**Slot Attention** is an architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which we call slots. These slots are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention. Using an iterative attention mechanism, slots produces a set of output vectors with permutation symmetry. Unlike capsules used in Capsule Networks, slots produced by Slot Attention do not specialize to one particular type or class of object, which could harm generalization. Instead, they act akin to object files, i.e., slots use a common representational format: each slot can store (and bind to) any object in the input. This allows Slot Attention to generalize in a systematic way to unseen compositions, more objects, and more slots.",
  "title": "Object-Centric Learning with Slot Attention",
  "collection": "Attention Modules",
  "area": "General"
}
{
  "name": "CSPDenseNet",
  "full_name": "CSPDenseNet",
  "description": "**CSPDenseNet** is a convolutional neural network and object detection backbone where we apply the Cross Stage Partial Network (CSPNet) approach to [DenseNet](https://paperswithcode.com/method/densenet). The CSPNet partitions the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network.",
  "title": "CSPNet: A New Backbone that can Enhance Learning Capability of CNN",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "DFDNet",
  "full_name": "DFDNet",
  "description": "**DFDNet**, or **DFDNet**, is a deep face dictionary network for face restoration to guide the restoration process of degraded observations. Given a LQ image $I\\_{d}$, the DFDNet selects the dictionary features that have the most similar structure with the input. Specially, we re-norm the whole dictionaries via component AdaIN (termed as CAdaIN) based on the input component to eliminate the distribution or style diversity. The selected dictionary features are then utilized to guide the restoration process via dictionary feature transformation.",
  "title": "Blind Face Restoration via Deep Multi-scale Component Dictionaries",
  "collection": "Face Restoration Models",
  "area": "Computer Vision"
}
{
  "name": "ScheduledDropPath",
  "full_name": "ScheduledDropPath",
  "description": "**ScheduledDropPath** is a modified version of [DropPath](https://paperswithcode.com/method/droppath). In DropPath, each path in the cell is stochastically dropped with some fixed probability during training. In ScheduledDropPath, each path in the cell is dropped out with a probability that is linearly increased over the course of training.",
  "title": "Learning Transferable Architectures for Scalable Image Recognition",
  "collection": "Regularization",
  "area": "General"
}
{
  "name": "Routing Transformer",
  "full_name": "Routing Transformer",
  "description": "The **Routing Transformer** is a [Transformer](https://paperswithcode.com/method/transformer) that endows self-attention with a sparse routing module based on online k-means. Each attention module considers a clustering of the space: the current timestep only attends to context belonging to the same cluster. In other word, the current time-step query is routed to a limited number of context through its cluster assignment.",
  "title": "Efficient Content-Based Sparse Attention with Routing Transformers",
  "collection": "Transformers",
  "area": "Natural Language Processing"
}
{
  "name": "Playstyle Distance",
  "full_name": "Playstyle Distance",
  "description": "This method proposes first discretizing observations and calculating the action distribution distance under comparable cases (intersection states).",
  "title": "An Unsupervised Video Game Playstyle Metric via State Discretization",
  "collection": "State Similarity Metrics",
  "area": "Reinforcement Learning"
}
{
  "name": "Sarsa",
  "full_name": "Sarsa",
  "description": "**Sarsa** is an on-policy TD control algorithm:\r\n\r\n$$Q\\left(S\\_{t}, A\\_{t}\\right) \\leftarrow Q\\left(S\\_{t}, A\\_{t}\\right) + \\alpha\\left[R_{t+1} + \\gamma{Q}\\left(S\\_{t+1}, A\\_{t+1}\\right) - Q\\left(S\\_{t}, A\\_{t}\\right)\\right] $$\r\n\r\nThis update is done after every transition from a nonterminal state $S\\_{t}$. if $S\\_{t+1}$ is terminal, then $Q\\left(S\\_{t+1}, A\\_{t+1}\\right)$ is defined as zero.\r\n\r\nTo design an on-policy control algorithm using Sarsa, we estimate $q\\_{\\pi}$ for a behaviour policy $\\pi$ and then change $\\pi$ towards greediness with respect to $q\\_{\\pi}$.\r\n\r\nSource: Sutton and Barto, Reinforcement Learning, 2nd Edition",
  "title": null,
  "collection": "On-Policy TD Control",
  "area": "Reinforcement Learning"
}
{
  "name": "Retrace",
  "full_name": "Retrace",
  "description": "**Retrace** is an off-policy Q-value estimation algorithm which has guaranteed convergence for a target and behaviour policy $\\left(\\pi, \\beta\\right)$. With off-policy rollout for TD learning, we must use importance sampling for the update:\r\n\r\n$$ \\Delta{Q}^{\\text{imp}}\\left(S\\_{t}, A\\_{t}\\right) = \\gamma^{t}\\prod\\_{1\\leq{\\tau}\\leq{t}}\\frac{\\pi\\left(A\\_{\\tau}\\mid{S\\_{\\tau}}\\right)}{\\beta\\left(A\\_{\\tau}\\mid{S\\_{\\tau}}\\right)}\\delta\\_{t} $$\r\n\r\nThis product term can lead to high variance, so Retrace modifies $\\Delta{Q}$ to have importance weights truncated by no more than a constant $c$:\r\n\r\n$$ \\Delta{Q}^{\\text{imp}}\\left(S\\_{t}, A\\_{t}\\right) = \\gamma^{t}\\prod\\_{1\\leq{\\tau}\\leq{t}}\\min\\left(c, \\frac{\\pi\\left(A\\_{\\tau}\\mid{S\\_{\\tau}}\\right)}{\\beta\\left(A\\_{\\tau}\\mid{S\\_{\\tau}}\\right)}\\right)\\delta\\_{t} $$",
  "title": "Safe and Efficient Off-Policy Reinforcement Learning",
  "collection": "Value Function Estimation",
  "area": "Reinforcement Learning"
}
{
  "name": "CoVA",
  "full_name": "Context-aware Visual Attention-based (CoVA) webpage object detection pipeline",
  "description": "Context-Aware Visual Attention-based end-to-end pipeline for Webpage Object Detection (_CoVA_) aims to learn function _f_ to predict labels _y = [$y_1, y_2, ..., y_N$]_ for a webpage containing _N_ elements. The input to CoVA consists of:\r\n1. a screenshot of a webpage,\r\n2. list of bounding boxes _[x, y, w, h]_ of the web elements, and\r\n3. neighborhood information for each element obtained from the DOM tree.\r\n\r\nThis information is processed in four stages:\r\n1. the graph representation extraction for the webpage,\r\n2. the Representation Network (_RN_),\r\n3. the Graph Attention Network (_GAT_), and\r\n4. a fully connected (_FC_) layer.\r\n\r\nThe graph representation extraction computes for every web element _i_ its set of _K_ neighboring web elements _$N_i$_. The _RN_ consists of a Convolutional Neural Net (_CNN_) and a positional encoder aimed to learn a visual representation _$v_i$_ for each web element _i &isin; {1, ..., N}_. The _GAT_ combines the visual representation _$v_i$_ of the web element _i_ to be classified and those of its neighbors, i.e., _$v_k$ &forall;k &isin; $N_i$_ to compute the contextual representation _$c_i$_ for web element _i_. Finally, the visual and contextual representations of the web element are concatenated and passed through the _FC_ layer to obtain the classification output.",
  "title": "CoVA: Context-aware Visual Attention for Webpage Information Extraction",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "DimConv",
  "full_name": "Dimension-wise Convolution",
  "description": "A **Dimension-wise Convolution**, or **DimConv**, is a type of [convolution](https://paperswithcode.com/method/convolution) that can encode depth-wise, width-wise, and height-wise information independently. To achieve this, DimConv extends depthwise convolutions to all dimensions of the input tensor $X \\in \\mathbb{R}^{D\\times{H}\\times{W}}$, where $W$, $H$, and $D$ corresponds to width, height, and depth of $X$. DimConv has three branches, one branch per dimension. These branches apply $D$ depth-wise convolutional kernels $k\\_{D} \\in \\mathbb{R}^{1\\times{n}\\times{n}}$ along depth, $W$ width-wise convolutional kernels $k\\_{W} \\in \\mathbb{R}^{n\\times{1}\\times{1}}$ along width, and $H$ height-wise convolutional kernels $k\\_{H} \\in \\mathbb{R}^{n\\times{1}\\times{n}}$ kernels along height\r\nto produce outputs $Y\\_{D}$, $Y\\_{W}$, and $Y\\_{H} \\in \\mathbb{R}^{D\\times{H}\\times{W}}$ that\r\nencode information from all dimensions of the input tensor. The outputs of these independent branches are concatenated along the depth dimension, such that the first spatial plane of $Y\\_{D}$, $Y\\_{W}$, and $Y\\_{H}$ are put together and so on, to produce the output $Y\\_{Dim} = ${$Y\\_{D}$, $Y\\_{W}$, $Y\\_{H}$} $\\in \\mathbb{R}^{3D\\times{H}\\times{W}}$.",
  "title": "DiCENet: Dimension-wise Convolutions for Efficient Networks",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "SyncBN",
  "full_name": "Synchronized Batch Normalization",
  "description": "**Synchronized Batch Normalization (SyncBN)** is a type of [batch normalization](https://paperswithcode.com/method/batch-normalization) used for multi-GPU training. Standard batch normalization only normalizes the data within each device (GPU). SyncBN normalizes the input within the whole mini-batch.",
  "title": "Context Encoding for Semantic Segmentation",
  "collection": "Normalization",
  "area": "General"
}
{
  "name": "Libra R-CNN",
  "full_name": "Libra R-CNN",
  "description": "**Libra R-CNN** is an object detection model that seeks to achieve a balanced training procedure. The authors motivation is that training in past detectors has suffered from imbalance during the training process, which generally consists in three levels – sample level, feature level, and objective level. To mitigate the adverse effects, Libra R-CNN integrates three novel components: IoU-balanced\r\nsampling, [balanced feature pyramid](https://paperswithcode.com/method/balanced-feature-pyramid), and [balanced L1 loss](https://paperswithcode.com/method/balanced-l1-loss), respectively for reducing the imbalance at sample, feature, and objective level.",
  "title": "Libra R-CNN: Towards Balanced Learning for Object Detection",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "Harris Hawks optimization (HHO)",
  "full_name": "Harris Hawks optimization",
  "description": "[HHO](https://aliasgharheidari.com/HHO.html) is a popular swarm-based, gradient-free optimization algorithm with several active and time-varying phases of exploration and exploitation. This algorithm initially published by the prestigious Journal of Future Generation Computer Systems (FGCS) in 2019, and from the first day, it has gained increasing attention among researchers due to its flexible structure, high performance, and high-quality results. The main logic of the HHO method is designed based on the cooperative behaviour and chasing styles of Harris' hawks in nature called \"surprise pounce\". Currently, there are many suggestions about how to enhance the functionality of HHO, and there are also several enhanced variants of the HHO in the leading Elsevier and IEEE transaction journals.\r\n\r\nFrom the algorithmic behaviour viewpoint, there are several effective features in HHO :\r\nEscaping energy parameter has a dynamic randomized time-varying nature, which can further improve and harmonize the exploratory and exploitive patterns of HHO. This factor also supports HHO to conduct a smooth transition between exploration and exploitation.\r\nDifferent exploration mechanisms with respect to the average location of hawks can increase the exploratory trends of HHO throughout initial iterations.\r\nDiverse LF-based patterns with short-length jumps enrich the exploitative behaviours of HHO when directing a local search.\r\nThe progressive selection scheme supports search agents to progressively advance their position and only select a better position, which can improve the superiority of solutions and intensification powers of HHO throughout the optimization procedure.\r\nHHO shows a series of searching strategies and then, it selects the best movement step. This feature has also a constructive influence on the exploitation inclinations of HHO.\r\nThe randomized jump strength can assist candidate solutions in harmonising the exploration and exploitation leanings.\r\nThe application of adaptive and time-varying components allows HHO to handle difficulties of a feature space including local optimal solutions, multi-modality, and deceptive optima.\r\n\r\n🔗 The source codes of HHO are publicly available at https://aliasgharheidari.com/HHO.html",
  "title": null,
  "collection": "Optimization",
  "area": "General"
}
{
  "name": "NormLinComb",
  "full_name": "Normalized Linear Combination of Activations",
  "description": "The **Normalized Linear Combination of Activations**, or **NormLinComb**, is a type of activation function that has trainable parameters and uses the normalized linear combination of other activation functions.\r\n\r\n$$NormLinComb(x) = \\frac{\\sum\\limits_{i=0}^{n} w_i \\mathcal{F}_i(x)}{\\mid \\mid W \\mid \\mid}$$",
  "title": "Trainable Activations for Image Classification",
  "collection": "Activation Functions",
  "area": "General"
}
{
  "name": "AdaSqrt",
  "full_name": "AdaSqrt",
  "description": "**AdaSqrt** is a stochastic optimization technique that is motivated by the observation that methods like [Adagrad](https://paperswithcode.com/method/adagrad) and [Adam](https://paperswithcode.com/method/adam) can be viewed as relaxations of [Natural Gradient Descent](https://paperswithcode.com/method/natural-gradient-descent).\r\n\r\nThe updates are performed as follows:\r\n\r\n$$ t \\leftarrow t + 1 $$\r\n\r\n$$ \\alpha\\_{t} \\leftarrow \\sqrt{t} $$\r\n\r\n$$ g\\_{t} \\leftarrow \\nabla\\_{\\theta}f\\left(\\theta\\_{t-1}\\right) $$\r\n\r\n$$ S\\_{t} \\leftarrow S\\_{t-1} + g\\_{t}^{2} $$\r\n\r\n$$ \\theta\\_{t+1} \\leftarrow \\theta\\_{t} + \\eta\\frac{\\alpha\\_{t}g\\_{t}}{S\\_{t} + \\epsilon} $$",
  "title": "Second-order Information in First-order Optimization Methods",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "IoU-Balanced Sampling",
  "full_name": "IoU-Balanced Sampling",
  "description": "**IoU-Balanced Sampling** is hard mining method for object detection. Suppose we need to sample $N$ negative samples from $M$ corresponding candidates. The selected probability for each sample under random sampling is:\r\n\r\n$$ p = \\frac{N}{M} $$\r\n\r\nTo raise the selected probability of hard negatives, we evenly split the sampling interval into $K$ bins according to IoU. $N$ demanded negative samples are equally distributed to each bin. Then we select samples from them uniformly. Therefore, we get the selected probability under IoU-balanced sampling:\r\n\r\n$$ p\\_{k} = \\frac{N}{K}*\\frac{1}{M\\_{k}}\\text{ , } k\\in\\left[0, K\\right)$$\r\n\r\nwhere $M\\_{k}$ is the number of sampling candidates in the corresponding interval denoted by $k$. $K$ is set to 3 by default in our experiments.\r\n\r\nThe sampled histogram with IoU-balanced sampling is shown by green color in the Figure to the right. The IoU-balanced sampling can guide the distribution of training samples close to the one of hard negatives.",
  "title": "Libra R-CNN: Towards Balanced Learning for Object Detection",
  "collection": "Prioritized Sampling",
  "area": "General"
}
{
  "name": "NVAE Generative Residual Cell",
  "full_name": "NVAE Generative Residual Cell",
  "description": "The **NVAE Generative Residual Cell** is a skip connection block used as part of the [NVAE](https://paperswithcode.com/method/nvae) architecture for the generator. The residual cell expands the number of channels $E$ times before applying the [depthwise separable convolution](https://paperswithcode.com/method/depthwise-separable-convolution), and then maps it back to $C$ channels. The design motivation was to help model long-range correlations in the data by increasing the receptive field of the network, which explains the expanding path but also the use of depthwise convolutions to keep a handle on parameter count.",
  "title": "NVAE: A Deep Hierarchical Variational Autoencoder",
  "collection": "Image Model Blocks",
  "area": "Computer Vision"
}
{
  "name": "IFNet",
  "full_name": "IFNet",
  "description": "**IFNet** is an architecture for video frame interpolation that adopts a coarse-to-fine strategy with progressively increased resolutions: it iteratively updates intermediate flows and soft fusion mask via successive [IFBlocks](https://paperswithcode.com/method/ifblock). Conceptually, according to the iteratively updated flow fields, we can move corresponding pixels from two input frames to the same location in a latent intermediate frame and use a fusion mask to combine pixels from two input frames. Unlike most previous optical flow models, IFBlocks do not contain expensive operators like cost volume or forward warping and use 3 × 3 [convolution](https://paperswithcode.com/method/convolution) and deconvolution as building blocks.",
  "title": "RIFE: Real-Time Intermediate Flow Estimation for Video Frame Interpolation",
  "collection": "Video Frame Interpolation",
  "area": "Computer Vision"
}
{
  "name": "FoveaBox",
  "full_name": "FoveaBox",
  "description": "**FoveaBox** is anchor-free framework for object detection. Instead of using predefined anchors to enumerate possible locations, scales and aspect ratios for the search of the objects, FoveaBox directly learns the object existing possibility and the bounding box coordinates without anchor reference. This is achieved by: (a) predicting category-sensitive semantic maps for the object existing possibility, and (b) producing category-agnostic bounding box for each position that potentially contains an object. The scales of target boxes are naturally associated with feature pyramid representations for each input image\r\n\r\nIt is a single, unified network composed of a backbone network and two task-specific subnetworks. The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-shelf convolutional network. The first subnet performs per pixel classification on the backbone’s output; the second subnet performs bounding box prediction for the corresponding\r\nposition.",
  "title": "FoveaBox: Beyond Anchor-based Object Detector",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "CBoW Word2Vec",
  "full_name": "Continuous Bag-of-Words Word2Vec",
  "description": "**Continuous Bag-of-Words Word2Vec** is an architecture for creating word embeddings that uses $n$ future words as well as $n$ past words to create a word embedding. The objective function for CBOW is:\r\n\r\n$$ J\\_\\theta = \\frac{1}{T}\\sum^{T}\\_{t=1}\\log{p}\\left(w\\_{t}\\mid{w}\\_{t-n},\\ldots,w\\_{t-1}, w\\_{t+1},\\ldots,w\\_{t+n}\\right) $$\r\n\r\nIn the CBOW model, the distributed representations of context are used to predict the word in the middle of the window. This contrasts with [Skip-gram Word2Vec](https://paperswithcode.com/method/skip-gram-word2vec) where the distributed representation of the input word is used to predict the context.",
  "title": "Efficient Estimation of Word Representations in Vector Space",
  "collection": "Word Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "ViP-DeepLab",
  "full_name": "ViP-DeepLab",
  "description": "**ViP-DeepLab** is a model for depth-aware video panoptic segmentation. It extends Panoptic-[DeepLab](https://paperswithcode.com/method/deeplab) by adding a depth prediction head to perform monocular depth estimation and a next-frame instance branch which regresses to the object centers in frame $t$ for frame $t + 1$.  This allows the model to jointly perform video panoptic segmentation and monocular depth estimation.",
  "title": "ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation",
  "collection": "Video Panoptic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "Anti-Alias Downsampling",
  "full_name": "Anti-Alias Downsampling",
  "description": "**Anti-Alias Downsampling (AA)** aims to improve the shift-equivariance of deep networks. Max-pooling is inherently composed of two operations. The first operation is to densely evaluate the max operator and second operation is naive subsampling. AA is proposed as a low-pass filter between them to achieve practical anti-aliasing in any existing strided layer such as strided [convolution](https://paperswithcode.com/method/convolution). The smoothing factor can be adjusted by changing the blur kernel filter size, where a larger filter size results in increased blur.",
  "title": "Making Convolutional Networks Shift-Invariant Again",
  "collection": "Downsampling",
  "area": "Computer Vision"
}
{
  "name": "Gradient-Based Subword Tokenization",
  "full_name": "GBST",
  "description": "**GBST**, or **Gradient-based Subword Tokenization Module**, is a soft gradient-based subword tokenization module that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network.  \r\n\r\nGBST learns a position-wise soft selection over candidate subword blocks by scoring them with a scoring network. In contrast to prior tokenization-free methods, GBST learns interpretable latent subwords, which enables easy inspection of lexical representations and is more efficient than other byte-based models.",
  "title": "Charformer: Fast Character Transformers via Gradient-based Subword Tokenization",
  "collection": "Subword Segmentation",
  "area": "Natural Language Processing"
}
{
  "name": "SCNN_UNet_ConvLSTM",
  "full_name": "Spatial CNN with UNet based Encoder-decoder and ConvLSTM",
  "description": "Spatial CNN with UNet based Encoder-decoder and ConvLSTM",
  "title": "A Hybrid Spatial-temporal Deep Learning Architecture for Lane Detection",
  "collection": "Image Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "EfficientUNet++",
  "full_name": "EfficientUNet++",
  "description": "Decoder architecture inspired on the [UNet++](https://paperswithcode.com/method/unet) structure and the [EfficientNet](https://paperswithcode.com/method/efficientnet) building blocks. Keeping the UNet++ structure, the EfficientUNet++ achieves higher performance and significantly lower computational complexity through two simple modifications:\r\n\r\n* Replaces the 3x3 convolutions of the UNet++ with residual bottleneck blocks with depthwise convolutions\r\n* Applies channel and spatial attention to the bottleneck feature maps using [concurrent spatial and channel squeeze & excitation (scSE)](https://paperswithcode.com/method/scse) blocks",
  "title": "Encoder-Decoder Architectures for Clinically Relevant Coronary Artery Segmentation",
  "collection": "Semantic Segmentation Models",
  "area": "Computer Vision"
}
{
  "name": "AlphaFold",
  "full_name": "AlphaFold",
  "description": "AlphaFold is a deep learning based algorithm for accurate protein structure prediction. AlphaFold incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.\r\n\r\nDescription from: [Highly accurate protein structure prediction with AlphaFold](https://paperswithcode.com/paper/highly-accurate-protein-structure-prediction)\r\n\r\nImage credit: [DeepMind](https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology)",
  "title": "Highly accurate protein structure prediction with AlphaFold",
  "collection": null,
  "area": null
}
{
  "name": "Jukebox",
  "full_name": "Jukebox",
  "description": "**Jukebox** is a model that generates music with singing in the raw audio domain. It tackles the long context of raw audio using a multi-scale [VQ-VAE](https://paperswithcode.com/method/vq-vae) to compress it to discrete codes, and modeling those using [autoregressive Transformers](https://paperswithcode.com/methods/category/autoregressive-transformers). It can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable.\r\n\r\nThree separate VQ-VAE models are trained with different temporal resolutions. At each level, the input audio is segmented and encoded into latent vectors $\\mathbf{h}\\_{t}$, which are then quantized to the closest codebook vectors $\\mathbf{e}\\_{z\\_{t}}$. The code $z\\_{t}$ is a discrete representation of the audio that we later train our prior on. The decoder takes the sequence of codebook vectors and reconstructs the audio. The top level learns the highest degree of abstraction, since it is encoding longer audio per token while keeping the codebook size the same. Audio can be reconstructed using the codes at any one of the abstraction levels, where the least abstract bottom-level codes result in the highest-quality audio.",
  "title": "Jukebox: A Generative Model for Music",
  "collection": "Generative Audio Models",
  "area": "Audio"
}
{
  "name": "IMPALA",
  "full_name": "IMPALA",
  "description": "**IMPALA**, or the **Importance Weighted Actor Learner Architecture**, is an off-policy actor-critic framework that decouples acting from learning and learns from experience trajectories using [V-trace](https://paperswithcode.com/method/v-trace). Unlike the popular [A3C](https://paperswithcode.com/method/a3c)-based agents, in which workers communicate gradients with respect to the parameters of the policy to a central parameter server, IMPALA actors communicate trajectories of experience (sequences of states, actions, and rewards) to a centralized learner. Since the learner in IMPALA has access to full trajectories of experience we use a GPU to perform updates on mini-batches of trajectories while aggressively parallelising all time independent operations. \r\n\r\nThis type of decoupled architecture can achieve very high throughput. However, because the policy used to generate a trajectory can lag behind the policy on the learner by several updates at the time of gradient calculation, learning becomes off-policy. The V-trace off-policy actor-critic algorithm is used to correct for this harmful discrepancy.",
  "title": "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "BatchFormer",
  "full_name": "Batch Transformer",
  "description": "learn to explore the sample relationships via transformer networks",
  "title": "BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "ComplEx-N3-RP",
  "full_name": "ComplEx with N3 Regularizer and Relation Prediction Objective",
  "description": "ComplEx model trained with a nuclear norm regularizer; A relation prediction objective is added on top of the commonly used 1vsAll objective.",
  "title": "Relation Prediction as an Auxiliary Training Objective for Improving Multi-Relational Graph Representations",
  "collection": "Graph Embeddings",
  "area": "Graphs"
}
{
  "name": "GAT",
  "full_name": "Graph Attention Network",
  "description": "A **Graph Attention Network (GAT)** is a neural network architecture that operates on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods’ features, a GAT enables (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront.\r\n\r\nSee [here](https://docs.dgl.ai/en/0.4.x/tutorials/models/1_gnn/9_gat.html) for an explanation by DGL.",
  "title": "Graph Attention Networks",
  "collection": "Graph Models",
  "area": "Graphs"
}
{
  "name": "VGG Loss",
  "full_name": "VGG Loss",
  "description": "**VGG Loss** is a type of content loss introduced in the [Perceptual Losses for Real-Time Style Transfer and Super-Resolution](https://paperswithcode.com/paper/perceptual-losses-for-real-time-style) super-resolution and style transfer framework. It is an alternative to pixel-wise losses; VGG Loss attempts to be closer to perceptual similarity. The [VGG](https://paperswithcode.com/method/vgg) loss is based on the [ReLU](https://paperswithcode.com/method/relu) activation layers of the pre-trained 19 layer VGG network. With $\\phi\\_{i,j}$ we indicate the feature map obtained by the $j$-th [convolution](https://paperswithcode.com/method/convolution) (after activation) before the $i$-th maxpooling layer within the VGG19 network, which we consider given. We then define the VGG loss as the euclidean distance between the feature representations of a reconstructed image $G\\_{\\theta\\_{G}}\\left(I^{LR}\\right)$ and the reference image $I^{HR}$:\r\n\r\n$$ l\\_{VGG/i.j} = \\frac{1}{W\\_{i,j}H\\_{i,j}}\\sum\\_{x=1}^{W\\_{i,j}}\\sum\\_{y=1}^{H\\_{i,j}}\\left(\\phi\\_{i,j}\\left(I^{HR}\\right)\\_{x, y} - \\phi\\_{i,j}\\left(G\\_{\\theta\\_{G}}\\left(I^{LR}\\right)\\right)\\_{x, y}\\right)^{2}$$ \r\n\r\nHere $W\\_{i,j}$ and $H\\_{i,j}$ describe the dimensions of the respective feature maps within the VGG network.",
  "title": "Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network",
  "collection": "Loss Functions",
  "area": "General"
}
{
  "name": "Local SGD",
  "full_name": "Local SGD",
  "description": "**Local SGD** is a distributed training technique that runs [SGD](https://paperswithcode.com/method/sgd) independently in parallel on different workers and averages the sequences only once in a while.",
  "title": "Local SGD Converges Fast and Communicates Little",
  "collection": "Stochastic Optimization",
  "area": "General"
}
{
  "name": "Attention Mesh",
  "full_name": "Attention Mesh",
  "description": "**Attention Mesh** is a neural network architecture for 3D face mesh prediction that uses attention to semantically meaningful regions. Specifically region-specific heads are employed that transform the feature maps with spatial transformers.",
  "title": "Attention Mesh: High-fidelity Face Mesh Prediction in Real-time",
  "collection": "3D Face Mesh Models",
  "area": "Computer Vision"
}
{
  "name": "Differentiable NAS",
  "full_name": "Differentiable Neural Architecture Search",
  "description": "",
  "title": "DARTS: Differentiable Architecture Search",
  "collection": "Neural Architecture Search",
  "area": "General"
}
{
  "name": "MDPO",
  "full_name": "Mirror Descent Policy Optimization",
  "description": "**Mirror Descent Policy Optimization (MDPO)** is a policy gradient algorithm based on the idea of iteratively solving a trust-region problem that minimizes a sum of two terms: a linearization of the standard RL objective function and a proximity term that restricts two consecutive updates to be close to each other. It is based on Mirror Descent, which is a general trust region method that\r\nattempts to keep consecutive iterates close to each other.",
  "title": "Mirror Descent Policy Optimization",
  "collection": "Policy Gradient Methods",
  "area": "Reinforcement Learning"
}
{
  "name": "Gumbel Softmax",
  "full_name": "Gumbel Softmax",
  "description": "**Gumbel-Softmax** is a continuous distribution that has the property that it can be smoothly annealed into a categorical distribution, and whose parameter gradients can be easily computed via the reparameterization trick.",
  "title": "Categorical Reparameterization with Gumbel-Softmax",
  "collection": "Distributions",
  "area": "General"
}
{
  "name": "LPM",
  "full_name": "Local Prior Matching",
  "description": "**Local Prior Matching** is a semi-supervised objective for speech recognition that distills knowledge from a strong prior (e.g. a language model) to provide learning signal to a discriminative model trained on unlabeled speech. The LPM objective minimizes the cross entropy between the local prior and the model distribution, and is minimized when $q\\_{y\\mid{x}} = \\bar{p}\\_{y\\mid{x}}$. Intuitively, LPM encourages the ASR model to assign posterior probabilities proportional to the linguistic probabilities of the proposed hypotheses.",
  "title": "Semi-Supervised Speech Recognition via Local Prior Matching",
  "collection": "Semi-Supervised Learning Methods",
  "area": "General"
}
{
  "name": "Spatial & Temporal Attention",
  "full_name": "Spatial & Temporal Attention",
  "description": "Spatial & temporal attention combines the advantages of spatial attention and temporal attention as it adaptively selects both important regions and key frames. Some works compute temporal attention and spatial attention separately, while others produce joint spatio & temporal attention maps. Further works focusing on capturing pairwise relations.",
  "title": "An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data",
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "PP-YOLOv2",
  "full_name": "PP-YOLOv2",
  "description": "**PP-YOLOv2** is an object detector that extends upon [PP-YOLO](https://www.paperswithcode.com/method/pp-yolo) with several refinements:\r\n\r\n- A [Path Aggregation Network](https://paperswithcode.com/method/pafpn) is included for the FPN to compose bottom-up paths.\r\n- [Mish Activation functions](https://paperswithcode.com/method/mish) are used.\r\n- The input size is expanded.\r\n- An IoU aware branch is calculated with a soft label format.",
  "title": "PP-YOLOv2: A Practical Object Detector",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "Energy Based Process",
  "full_name": "Energy Based Process",
  "description": "**Energy Based Processes** extend energy based models to exchangeable data while allowing neural network parameterizations of the energy function. They extend the previously separate stochastic process and latent variable model perspectives in a common framework. The result is a generalization of [Gaussian processes](https://paperswithcode.com/method/gaussian-process) and Student-t processes that exploits EBMs for greater flexibility.",
  "title": "Energy-Based Processes for Exchangeable Data",
  "collection": "Non-Parametric Regression",
  "area": "General"
}
{
  "name": "Dense Connections",
  "full_name": "Dense Connections",
  "description": "**Dense Connections**, or **Fully Connected Connections**, are a type of layer in a deep neural network that use a linear operation where every input is connected to every output by a weight. This means there are $n\\_{\\text{inputs}}*n\\_{\\text{outputs}}$ parameters, which can lead to a lot of parameters for a sizeable network.\r\n\r\n$$h\\_{l} = g\\left(\\textbf{W}^{T}h\\_{l-1}\\right)$$\r\n\r\nwhere $g$ is an activation function.\r\n\r\nImage Source: Deep Learning by Goodfellow, Bengio and Courville",
  "title": null,
  "collection": "Feedforward Networks",
  "area": "General"
}
{
  "name": "SANet",
  "full_name": "Self-Attention Network",
  "description": "**Self-Attention Network** (**SANet**) proposes two variations of self-attention used for image recognition: 1) pairwise self-attention which generalizes standard [dot-product attention](https://paperswithcode.com/method/dot-product-attention) and is fundamentally a set operator, and 2) patchwise self-attention which is strictly more powerful than [convolution](https://paperswithcode.com/method/convolution).",
  "title": "Exploring Self-attention for Image Recognition",
  "collection": "Image Models",
  "area": "Computer Vision"
}
{
  "name": "CurricularFace",
  "full_name": "CurricularFace",
  "description": "**CurricularFace**, or **Adaptive Curriculum Learning**, is a method for face recognition that embeds the idea of curriculum learning into the loss function to achieve a new training scheme. This training scheme mainly addresses easy samples in the early training stage and hard ones in the later stage. Specifically, CurricularFace adaptively adjusts the relative importance of easy and hard samples during different training stages.",
  "title": "CurricularFace: Adaptive Curriculum Learning Loss for Deep Face Recognition",
  "collection": "Face Recognition Models",
  "area": "Computer Vision"
}
{
  "name": "MARLIN",
  "full_name": "MARLIN",
  "description": "",
  "title": "MARLIN: Masked Autoencoder for facial video Representation LearnINg",
  "collection": "Self-Supervised Learning",
  "area": "General"
}
{
  "name": "PVTv2",
  "full_name": "Pyramid Vision Transformer v2",
  "description": "**Pyramid Vision Transformer v2** (PVTv2) is a type of [Vision Transformer](https://paperswithcode.com/method/vision-transformer) for detection and segmentation tasks. It improves on [PVTv1](https://paperswithcode.com/method/pvt) through several design improvements: (1) overlapping patch embedding, (2) convolutional feed-forward networks, and (3) linear complexity attention layers that are orthogonal to the PVTv1 framework.",
  "title": "PVT v2: Improved Baselines with Pyramid Vision Transformer",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "SAG",
  "full_name": "Self-Attention Guidance",
  "description": "",
  "title": "Improving Sample Quality of Diffusion Models Using Self-Attention Guidance",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "DMVFN",
  "full_name": "A Dynamic Multi-Scale Voxel Flow Network",
  "description": "",
  "title": "A Dynamic Multi-Scale Voxel Flow Network for Video Prediction",
  "collection": "Structured Prediction",
  "area": "General"
}
{
  "name": "Hierarchical VAE",
  "full_name": "Hierarchical Variational Autoencoder",
  "description": "",
  "title": "Ladder Variational Autoencoders",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
{
  "name": "VLMo",
  "full_name": "Vision-Language pretrained Model",
  "description": "VLMo is a unified vision-language pre-trained model that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. A Mixture-of-Modality-Experts (MOME) transformer is introduced to encode different modalities which helps it to capture modality-specific information by modality experts, and align content of different modalities by the self-attention module shared across modalities. The model parameters are shared across image-text contrastive learning, masked language modeling, and image-text matching tasks. During fine-tuning, the flexible modeling allows for VLMO to be used as either a dual encoder (i.e., separately encode images and text for retrieval tasks) or a fusion encoder (i.e., jointly encode image-text pairs for better interaction across modalities) Stage-wise pretraining on image-only and text-only data improved the vision-language pre-trained model. The model can be used for classification tasks and fine-tuned as a dual encoder for retrieval tasks.",
  "title": "VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts",
  "collection": "Vision and Language Pre-Trained Models",
  "area": "Computer Vision"
}
{
  "name": "GCNet",
  "full_name": "GCNet",
  "description": "A **Global Context Network**, or **GCNet**, utilises global context blocks to model long-range dependencies in images. It is based on the [Non-Local Network](https://paperswithcode.com/method/non-local-block), but it modifies the architecture so less computation is required. Global context blocks are applied to multiple layers in a backbone network to construct the GCNet.",
  "title": "GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond",
  "collection": "Object Detection Models",
  "area": "Computer Vision"
}
{
  "name": "GRANDE",
  "full_name": "Gradient-Based Decision Tree Ensembles",
  "description": "",
  "title": "GRANDE: Gradient-Based Decision Tree Ensembles",
  "collection": "Deep Tabular Learning",
  "area": "General"
}
{
  "name": "MoCo v3",
  "full_name": "MoCo v3",
  "description": "**MoCo v3** aims to stabilize training of self-supervised ViTs. MoCo v3 is an incremental improvement of MoCo v1/2. Two crops are used for each image under random data augmentation. They are encoded by two encoders $f_q$ and $f_k$ with output vectors $q$ and $k$. $q$ behaves like a \"query\", where the goal of learning is to retrieve the corresponding \"key\". The objective is to minimize a contrastive loss function of the following form: \r\n\r\n$$\r\n\\mathcal{L_q}=-\\log \\frac{\\exp \\left(q \\cdot k^{+} / \\tau\\right)}{\\exp \\left(q \\cdot k^{+} / \\tau\\right)+\\sum_{k^{-}} \\exp \\left(q \\cdot k^{-} / \\tau\\right)}\r\n$$\r\n\r\nThis approach aims to train the Transformer in the contrastive/Siamese paradigm. The encoder $f_q$ consists of a backbone (e.g., ResNet and ViT), a projection head, and an extra prediction head. The encoder $f_k$ has the back the backbone and projection head but not the prediction head. $f_k$ is updated by the moving average of $f_q$, excluding the prediction head.",
  "title": "An Empirical Study of Training Self-Supervised Vision Transformers",
  "collection": "Vision Transformers",
  "area": "Computer Vision"
}
{
  "name": "Poincaré Embeddings",
  "full_name": "Poincaré Embeddings",
  "description": "**Poincaré Embeddings** learn hierarchical representations of symbolic data by embedding them into hyperbolic space -- or more precisely into an $n$-dimensional Poincaré ball. Due to the underlying hyperbolic geometry, this allows for learning of parsimonious representations of symbolic data by simultaneously capturing hierarchy and similarity. Embeddings are learnt based on\r\nRiemannian optimization.",
  "title": "Poincaré Embeddings for Learning Hierarchical Representations",
  "collection": "Word Embeddings",
  "area": "Natural Language Processing"
}
{
  "name": "Channel-wise Soft Attention",
  "full_name": "Channel-wise Soft Attention",
  "description": "**Channel-wise Soft Attention** is an attention mechanism in computer vision that assigns \"soft\" attention weights for each channel $c$. In soft channel-wise attention, the alignment weights are learned and placed \"softly\" over each channel. This would contrast with hard attention which would only selects one channel to attend to at a time.\r\n\r\nImage: [Xu et al](http://proceedings.mlr.press/v37/xuc15.pdf)",
  "title": null,
  "collection": "Attention Mechanisms",
  "area": "General"
}
{
  "name": "DetNet",
  "full_name": "DetNet",
  "description": "**DetNet** is a backbone convolutional neural network for object detection. Different from traditional pre-trained models for ImageNet classification, DetNet maintains the spatial resolution of the features even though extra stages are included. DetNet attempts to stay efficient by employing a low complexity dilated bottleneck structure.",
  "title": "DetNet: A Backbone network for Object Detection",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "CKConv",
  "full_name": "Continuous Kernel Convolution",
  "description": "",
  "title": "CKConv: Continuous Kernel Convolution For Sequential Data",
  "collection": "Convolutions",
  "area": "Computer Vision"
}
{
  "name": "Pruning",
  "full_name": "Pruning",
  "description": "",
  "title": "Pruning Filters for Efficient ConvNets",
  "collection": "Model Compression",
  "area": "General"
}
{
  "name": "Low-resolution input",
  "full_name": "Low-resolution input",
  "description": "",
  "title": "EfficientPose: Scalable single-person pose estimation",
  "collection": "Image Representations",
  "area": "Computer Vision"
}
{
  "name": "DenseNAS-B",
  "full_name": "DenseNAS-B",
  "description": "**DenseNAS-B** is a mobile convolutional neural network discovered through the [DenseNAS](https://paperswithcode.com/method/densenas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) method. The basic building block is MBConvs, or inverted bottleneck residuals, from the [MobileNet](https://paperswithcode.com/method/mobilenetv2) architectures.",
  "title": "Densely Connected Search Space for More Flexible Neural Architecture Search",
  "collection": "Convolutional Neural Networks",
  "area": "Computer Vision"
}
{
  "name": "StyleALAE",
  "full_name": "StyleALAE",
  "description": "**StyleALAE** is a type of [adversarial latent autoencoder](https://paperswithcode.com/method/alae) that uses a [StyleGAN](https://paperswithcode.com/method/stylegan) based generator. For this the latent space $\\mathcal{W}$ plays the same role as the intermediate latent space in [StyleGAN](https://paperswithcode.com/method/stylegan). Therefore, the $G$ network becomes the part of StyleGAN depicted on the right side of the Figure. The left side is a\r\nnovel architecture that we designed to be the encoder $E$. The StyleALAE encoder has [Instance Normalization](https://paperswithcode.com/method/instance-normalization) (IN) layers to extract multiscale style information that is combined into a latent code $w$ via a learnable multilinear map.",
  "title": "Adversarial Latent Autoencoders",
  "collection": "Generative Models",
  "area": "Computer Vision"
}
