Refered paper YOLOv1 to YOLOv10: A comprehensive review of YOLO variants and their application in the agricultural domain

YOLO model series are single-stage object detectors, where the two steps of two-stage object detectors are merged together. The two steps of two-stage object detector is firstly proposing candidate regions and then perform classification and localisation in the proposed regions. Single-stage object detectors directly predict bounding boxes and probability scores given input image.

YOLOV1

Data processing,
The input image is splitted into $s\times s$ cells through a $s\times s$ grid. For each object in the image, the cell which contains the center of an object will have the ground truth of probability score as 1 while other cells, even if they contain parts of the object, will have ground truth of probability score as 0. Then there will be a bounding box on the grid cell that contains the center of the object represented by coordinates, width and height. Each bounding box will also have a confidence score and the confidence score, which is defined as the product of the probability of the object locating inside the call and the IoU between bounding box and ground truth object region, will be 1. \[Confidence = p(j)\times IoU\]
During training,
The model will predict $B$ bounding boxes with specific dimensions and confidence score for each grid cells, causing $s\times s\times B$ bounding boxes for one image input. The loss is calculated through the bounding box location difference, the confidence score difference and the probability predicted for the cell that it contains the object $i$.
During inferencing, Given an image as input, the model will predict $s\times s\times B$ bounding boxes. For each cell, there would be $B$ bounding boxes. Each cell will have a multiclassification probability output representing its probability of containing the center of every object. Then the bounding box in the cell will have location information and confidence score to represent the area of overlap between the bounding box and the object. Because the huge number of bounding boxes, NMS(non-maximum suppresion) is also used to reduce bounding boxes amount.