Evaluating a foundation model for animal body segmentation based on few-shot learning against a domain-specific deep neural network

Abstract

Computer vision (CV) has been proposed as a powerful technology to collect individual measurements of livestock animals, such as body weight or body condition score, to name a few. In all these tasks, the first step of image processing is semantic segmentation (SS), which represents the use of deep neural networks to locate pixels that belong to the animal body, while removing the background, which may add noise and other information that are not needed to compute body biometrics. Despite the generalization abilities of CV, oftentimes SS models need to be re-trained on the specific dataset at hand in order to maximize performance, and thus require labor-intensive annotations. With the rise of foundation models like GPT-4, LLaMA, DALL-E, among others, we aim to explore whether a foundation model for SS called SegGPT performs well in a highly specific agricultural scenario, in comparison with a model trained using domain-specific data, i.e., a U-net model trained on our datasets. Our evaluation is carried out over 9 different datasets of top-down depth images over calves’ and cows’ bodies. The combined datasets amount to a total of 4328 images (average count = 541) from 485 animals (average count = 54) from 2 weeks to 7 years of age, and it was collected with a mix of RealSense D435 and Kinect V2 sensors at several farm settings (entering/exiting the milking parlor, in a chute, or during weighing on a scale or weighing cart). We investigated the performance of these two models under several scenarios - training on a single dataset and validating within the same dataset (Same Dataset Internal Validation; SDIV), training on a single dataset and validating on an external dataset (Same Dataset External Validation; SDEV), and training on multiple datasets and validating on an external dataset (Multiple Datasets External Validation; MDEV). For the U-net trained in the SDIV approach, we used a 70/30 train/test split over 100 epochs. SegGPT and U-net presented intersection over union of 0.84 and 0.95, 0.73 and 0.59, 0.84 and 0.91, for SDIV, SDEV, and MDEV, respectively. U-net was not able to segment the animal body with performance similar to SegGPT in a new dataset when trained with a single dataset, but it outperformed SegGPT when trained with multiple datasets. Although U-net performed slightly better on these scenarios, it’s important to highlight that SegGPT only received one prompt (one depth image with its annotation) per dataset and performed reasonably well on unseen (validation) datasets. In conclusion, foundation models for image processing tasks can be an alternative for training domain-specific deep neural networks, which demands labor, annotation, and large datasets to present satisfactory results. However, high performance still requires domain-specific models in similar agricultural scenarios.

Publication
In ASAS 2024
Enrico Casella
Enrico Casella
Assistant Professor of Data Science for Animal Systems

Multi-disciplinary computer scientist with a focus on Artificial Intelligence and Computer Vision applications for animal systems.