Deep Inverse Cooking
Description
Medical images are widely used in hospitals for the diagnosis and treatment of many diseases, such as skin cancer or diabetic retinopathy. Machine learning algorithms have recently been shown to outperform human doctors in a broad variety of diagnosis tasks. A diagnosis is often posed as a semantic segmentation problem where models are trained to classify each pixel of an image or as a multi-label classification task where the output is a set of tags. However, both types of outputs are hard to interpret due to the lack of reasoning about how the decisions were achieved. In contrast, a diagnosis made by a medical doctor is different. When a family doctor refers a patient to a specialist, he will expect a medical report in which the specialist explains her diagnosis. Likewise, the output of a neural network would be more useful if augmented by a medical report written in a natural language.
Recently, there has been much progress in the development of image-to-text models that the task of automatically generating medical reports can now be considered feasible. However, such models require a large amount of paired data, i.e. images paired with medical reports. To the author's best knowledge, there is no publicly available dataset of such paired data. In order to experiment with image-to-text models, domains were switched from medicine to cooking, where such data is prolific. A dataset consisting of 0.9M recipes and 1.3M images was acquired through crawling five different cooking platforms. Since the majority of the recipes originate from community cooking websites, an extensive data cleaning pipeline had to be implemented. This allowed the number of unique ingredients to be reduced from 1M to 1.3k at the cost of dropping some recipes.
Using this dataset, a multi-task neural network model was implemented, trained and evaluated. It generates a list of ingredients (cf. medical features), a title and cooking instructions (cf. medical report) based on an image of a dish. The model consists of a VGG-16 encoder to extract image features. Given these features, a transformer-based decoder generates a list of ingredients. Finally, an additional transformer decoder generates the recipe title as well as the cooking instructions by processing the image and ingredients features simultaneously. Evaluation on unseen test data showed that the model achieves an F1 score of 38.62% for the ingredients prediction, a BLEU1 score of 7.17% for generating the title and a BLEU4 score of 6.15% for the instructions text generation task. Comparing the architecture of the inverse cooking model to medical image captioning systems from the literature shows several similarities. Therefore, it is expected that the proposed model can be adapted and extended for generating medical reports in the future.
Notes
Files
VM02_MarcBravin_DeepInverseCooking_publish.pdf
Files
(9.4 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:37ec29416115185d15cffb8a9d1d8084
|
9.4 MB | Preview Download |