User:Fgpacini/Feature learning

From Wikipedia, the free encyclopedia
Diagram of the feature learning paradigm in machine learning for application to downstream tasks, which can be applied to either raw data such as images or text, or to an initial set of features for the data. Feature learning is intended to result in faster training or better performance in task-specific settings than if the data was inputted directly.[1]

Feature learning is typically either supervised, unsupervised or self-supervised.

Self-supervised[edit]

Self-supervised representation learning is learning features by training on the structure of unlabeled data rather than relying on explicit labels for an information signal. This approach has enabled the combined use of deep neural networks and larger unlabeled datasets in order to produce deep feature representations.[4] Training tasks typically fall under the classes of either contrastive, generative or both.[7] Contrastive representation learning trains representations for associated data pairs, called positive samples, to be aligned, while pairs with no relation, called negative samples, are contrasted. A larger portion of negative samples is typically necessary in order to prevent catastrophic collapse, which is when all inputs are mapped to the same representation.[8] Generative representation learning tasks the model with producing the correct data to either match a restricted input or reconstruct the full input from a lower dimensional representation.[7]

A common setup for self-supervised representation learning of a certain data type (e.g. text, image, audio, video) is to pretrain the model using large datasets of general context, unlabeled data.[1] Depending on the context, the result of this is either a set of representations for common data segments (e.g. words) which new data can be broken into, or a neural network able to convert new data into a set of lower dimensional features.[4] In either case, the output representations can then be used as an initialization in many different problem settings where labeled data may be limited. Specialization of the model to specific tasks is typically done with supervised learning, either by fine-tuning the model / representations with the labels as the signal, or freezing the representations and training an additional model which takes them as an input.[1]

Many self-supervised training schemes have been developed for use in representation learning of various modalities, often first showing successful application in text or image before being transferred to other data types.[4]

Text[edit]

Word2vec is a word embedding technique which learns to represent words through self-supervision over each word and its neighboring words in a sliding window across a large corpus of text.[9] The model has two possible training schemes to produce word vector representations, one generative and one contrastive.[7] The first is word prediction given each of the neighboring words as an input.[9] The second is training on the representation similarity for neighboring words and representation dissimilarity for random pairs of words.[5] A limitation of word2vec is that only the pairwise co-occurrence structure of the data is used, and not the ordering or entire set of context words. More recent transformer-based representation learning approaches attempt to solve this with word prediction tasks.[4] GPT pretrains on next word prediction using prior input words as context,[10] whereas BERT masks random tokens in order to provide bidirectional context.[11]

Other self-supervised techniques extend word embeddings by finding representations for larger text structures such as sentences or paragraphs in the input data.[4] Doc2vec extends the generative training approach in word2vec by adding an additional input to the word prediction task based on the paragraph it is within, and is therefore intended to represent paragraph level context.[12]

Graph[edit]

The goal of many graph representation learning techniques is to produce an embedded representation of each node based on the overall network topology.[13] node2vec extends the word2vec training technique to nodes in a graph by using co-occurrence in random walks through the graph as the measure of association.[14] Another approach is to maximize mutual information, a measure of similarity, between the representations of associated structures within the graph.[4] An example is Deep Graph Infomax, which uses contrastive self-supervision based on mutual information between the representation of a “patch” around each node, and a summary representation of the entire graph. Negative samples are obtained by pairing the graph representation with either representations from another graph in a multigraph training setting, or corrupted patch representations in single graph training.[15]

Image[edit]

The domain of image representation learning has employed many different self-supervised training techniques, including transformation,[16] inpainting,[17] patch discrimination[18] and clustering.[19]

Examples of generative approaches are Context Encoders, which trains an AlexNet CNN architecture to generate a removed image region given the masked image as input,[17] and iGPT, which applies the GPT-2 language model architecture to images by training on pixel prediction after reducing the image resolution.[20]

Many other self-supervised methods use siamese networks, which generate different views of the image through various augmentations that are then aligned to have similar representations. The challenge is avoiding collapsing solutions where the model encodes all images to the same representation.[21] SimCLR is a contrastive approach which uses negative examples in order to generate image representations with a ResNet CNN.[18] Bootstrap Your Own Latent (BYOL) removes the need for negative samples by encoding one of the views with a slow moving average of the model parameters as they are being modified during training.[22]

Video[edit]

With analogous results in masked prediction[23] and clustering,[24] video representation learning approaches are often similar to image techniques but must utilize the temporal sequence of video frames as an additional learned structure.[4] Examples include VCP, which masks video clips and trains to choose the correct one given a set of clip options,[23] and Xu et al., who train a 3D-CNN to identify the original order given a shuffled set of video clips.[25]

Audio[edit]

Self-supervised representation techniques have also been applied to many audio data formats, particularly for speech processing.[4] Wav2vec 2.0 discretizes the audio waveform into timesteps via temporal convolutions, and then trains a transformer on masked prediction of random timesteps using a contrastive loss.[26] This is similar to the BERT language model, except as in many SSL approaches to video, the model chooses among a set of options rather than over the entire word vocabulary.[23][26]

Multimodal[edit]

Self-supervised learning has more recently been used to develop joint representations of multiple data types.[4] Approaches usually rely on some natural or human-derived association between the modalities as an implicit label, for instance video clips of animals or objects with characteristic sounds,[27] or captions written to describe images.[28] CLIP produces a joint image-text representation space by training to align image and text encodings from a large dataset of image-caption pairs using a contrastive loss.[29] MERLOT Reserve trains a transformer-based encoder to jointly represent audio, subtitles and video frames from a large dataset of videos through 3 joint pretraining tasks: contrastive masked prediction of either audio or text segments given the video frames and surrounding audio and text context, along with contrastive alignment of video frames with their corresponding captions.[27]

Multimodal representation models are typically unable to assume direct correspondence of representations in the different modalities, since the precise alignment can often be noisy or ambiguous. For example, the text "dog" could be paired with many different pictures of dogs, and correspondingly a picture of a dog could be captioned with varying degrees of specificity. This limitation means that downstream tasks may require an additional generative mapping network between modalities to achieve optimal performance, such as in DALLE-2 for text to image generation.[30]

References[edit]

  1. ^ a b c Goodfellow, Ian (2016). Deep learning. Yoshua Bengio, Aaron Courville. Cambridge, Massachusetts. pp. 524–534. ISBN 0-262-03561-8. OCLC 955778308.{{cite book}}: CS1 maint: location missing publisher (link)
  2. ^ Stuart J. Russell, Peter Norvig (2010) Artificial Intelligence: A Modern Approach, Third Edition, Prentice Hall ISBN 9780136042594.
  3. ^ Hinton, Geoffrey; Sejnowski, Terrence (1999). Unsupervised Learning: Foundations of Neural Computation. MIT Press. ISBN 978-0262581684.
  4. ^ a b c d e f g h i j k Ericsson, Linus; Gouk, Henry; Loy, Chen Change; Hospedales, Timothy M. (May 2022). "Self-Supervised Representation Learning: Introduction, advances, and challenges". IEEE Signal Processing Magazine. 39 (3): 42–62. doi:10.1109/MSP.2021.3134634. ISSN 1558-0792.
  5. ^ a b Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg S; Dean, Jeff (2013). "Distributed Representations of Words and Phrases and their Compositionality". Advances in Neural Information Processing Systems. 26. Curran Associates, Inc.
  6. ^ Goodfellow, Ian (2016). Deep learning. Yoshua Bengio, Aaron Courville. Cambridge, Massachusetts. pp. 499–516. ISBN 0-262-03561-8. OCLC 955778308.
  7. ^ a b c Liu, Xiao; Zhang, Fanjin; Hou, Zhenyu; Mian, Li; Wang, Zhaoyu; Zhang, Jing; Tang, Jie (2021). "Self-supervised Learning: Generative or Contrastive". IEEE Transactions on Knowledge and Data Engineering: 1–1. doi:10.1109/TKDE.2021.3090866. ISSN 1558-2191.
  8. ^ Le-Khac, Phuc H.; Healy, Graham; Smeaton, Alan F. (2020). "Contrastive Representation Learning: A Framework and Review". IEEE Access. 8: 193907–193934. doi:10.1109/ACCESS.2020.3031549. ISSN 2169-3536.
  9. ^ a b Mikolov, Tomas; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013-09-06). "Efficient Estimation of Word Representations in Vector Space". arXiv:1301.3781 [cs].
  10. ^ "Improving Language Understanding by Generative Pre-Training" (PDF). Retrieved October 1, 2022.
  11. ^ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (June 2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics: 4171–4186. doi:10.18653/v1/N19-1423.
  12. ^ Le, Quoc; Mikolov, Tomas (2014-06-18). "Distributed Representations of Sentences and Documents". International Conference on Machine Learning. PMLR: 1188–1196.
  13. ^ Cai, HongYun; Zheng, Vincent W.; Chang, Kevin Chen-Chuan (September 2018). "A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications". IEEE Transactions on Knowledge and Data Engineering. 30 (9): 1616–1637. doi:10.1109/TKDE.2018.2807452. ISSN 1558-2191.
  14. ^ Grover, Aditya; Leskovec, Jure (2016-08-13). "node2vec: Scalable Feature Learning for Networks". Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD '16. New York, NY, USA: Association for Computing Machinery: 855–864. doi:10.1145/2939672.2939754. ISBN 978-1-4503-4232-2. PMC 5108654. PMID 27853626.{{cite journal}}: CS1 maint: PMC format (link)
  15. ^ Velikovi, P., Fedus, W., Hamilton, W. L., Li, P., Bengio, Y., and Hjelm, R. D. Deep Graph InfoMax. In International Conference on Learning Representations (ICLR’2019), 2019.
  16. ^ Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
  17. ^ a b Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, Alexei A. Efros; "Context Encoders: Feature Learning by Inpainting". Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2536-2544
  18. ^ a b Chen, Ting; Kornblith, Simon; Norouzi, Mohammad; Hinton, Geoffrey (2020-11-21). "A Simple Framework for Contrastive Learning of Visual Representations". International Conference on Machine Learning. PMLR: 1597–1607.
  19. ^ Mathilde, Caron,; Ishan, Misra,; Julien, Mairal,; Priya, Goyal,; Piotr, Bojanowski,; Armand, Joulin, (2020). "Unsupervised Learning of Visual Features by Contrasting Cluster Assignments". Advances in Neural Information Processing Systems. 33.{{cite journal}}: CS1 maint: extra punctuation (link) CS1 maint: multiple names: authors list (link)
  20. ^ Chen, Mark; Radford, Alec; Child, Rewon; Wu, Jeffrey; Jun, Heewoo; Luan, David; Sutskever, Ilya (2020-11-21). "Generative Pretraining From Pixels". International Conference on Machine Learning. PMLR: 1691–1703.
  21. ^ Xinlei Chen, Kaiming He; "Exploring Simple Siamese Representation Learning" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 15750-15758
  22. ^ Jean-Bastien, Grill,; Florian, Strub,; Florent, Altché,; Corentin, Tallec,; Pierre, Richemond,; Elena, Buchatskaya,; Carl, Doersch,; Bernardo, Avila Pires,; Zhaohan, Guo,; Mohammad, Gheshlaghi Azar,; Bilal, Piot,; koray, kavukcuoglu,; Remi, Munos,; Michal, Valko, (2020). "Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning". Advances in Neural Information Processing Systems. 33.{{cite journal}}: CS1 maint: extra punctuation (link) CS1 maint: multiple names: authors list (link)
  23. ^ a b c Luo, Dezhao; Liu, Chang; Zhou, Yu; Yang, Dongbao; Ma, Can; Ye, Qixiang; Wang, Weiping (2020-04-03). "Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning". Proceedings of the AAAI Conference on Artificial Intelligence. 34 (07): 11701–11708. doi:10.1609/aaai.v34i07.6840. ISSN 2374-3468.
  24. ^ Humam, Alwassel,; Dhruv, Mahajan,; Bruno, Korbar,; Lorenzo, Torresani,; Bernard, Ghanem,; Du, Tran, (2020). "Self-Supervised Learning by Cross-Modal Audio-Video Clustering". Advances in Neural Information Processing Systems. 33.{{cite journal}}: CS1 maint: extra punctuation (link) CS1 maint: multiple names: authors list (link)
  25. ^ Xu, Dejing; Xiao, Jun; Zhao, Zhou; Shao, Jian; Xie, Di; Zhuang, Yueting (June 2019). "Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction". 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 10326–10335. doi:10.1109/CVPR.2019.01058.
  26. ^ a b Alexei, Baevski,; Yuhao, Zhou,; Abdelrahman, Mohamed,; Michael, Auli, (2020). "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations". Advances in Neural Information Processing Systems. 33.{{cite journal}}: CS1 maint: extra punctuation (link) CS1 maint: multiple names: authors list (link)
  27. ^ a b Zellers, Rowan; Lu, Jiasen; Lu, Ximing; Yu, Youngjae; Zhao, Yanpeng; Salehi, Mohammadreza; Kusupati, Aditya; Hessel, Jack; Farhadi, Ali; Choi, Yejin (2022). "MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound": 16375–16387. {{cite journal}}: Cite journal requires |journal= (help)
  28. ^ Radford, Alec; Kim, Jong Wook; Hallacy, Chris; Ramesh, Aditya; Goh, Gabriel; Agarwal, Sandhini; Sastry, Girish; Askell, Amanda; Mishkin, Pamela; Clark, Jack; Krueger, Gretchen; Sutskever, Ilya (2021-07-01). "Learning Transferable Visual Models From Natural Language Supervision". International Conference on Machine Learning. PMLR: 8748–8763.
  29. ^ Radford, Alec; Kim, Jong Wook; Hallacy, Chris; Ramesh, Aditya; Goh, Gabriel; Agarwal, Sandhini; Sastry, Girish; Askell, Amanda; Mishkin, Pamela; Clark, Jack; Krueger, Gretchen; Sutskever, Ilya (2021-07-01). "Learning Transferable Visual Models From Natural Language Supervision". International Conference on Machine Learning. PMLR: 8748–8763.
  30. ^ Ramesh, Aditya; Dhariwal, Prafulla; Nichol, Alex; Chu, Casey; Chen, Mark (2022-04-12). "Hierarchical Text-Conditional Image Generation with CLIP Latents". arXiv:2204.06125 [cs].