People quickly recognise human actions carried out in everyday activities. There is evidence that Minimal Recognisable Configurations (MIRCs) contain a combination of spatial and temporal visual features critical for reliable recognition. For complex activities, observers may have different descriptions varied in their semantic similarity (e.g., washing dishes vs cleaning dishes), potentially complicating the investigation of MIRCs in action recognition. Therefore, we measured the semantic consistency for 128 short videos of complex actions from the Epic-Kitchens-100 dataset (Damen et al., 2022), selected based on poor classification performance by our state-of-the-art computer vision network MOFO (Ahmadian et al., 2023). In an online experiment, participants viewed each video and identified the performed action by typing a description using 2-3 words (capturing action and object). Each video was classified by at least 30 participants (N=76 total). Semantic consistency of the responses was determined using a custom pipeline involving the sentence-BERT language model, which generated embedding vectors representing semantic properties of the responses. We then used adjusted pair-wise cosine similarities between response vectors to compute a ground truth description for each video, a response with the greatest semantic neighbourhood density (e.g., pouring oil, closing shelf). The greater the semantic neighbourhood density was for a ground truth candidate, the more semantically consistent were responses for the associated video. We uncovered 87 videos where semantic consistency confirmed their reliable recognisability, i.e. where cosine-similarity between the ground truth candidate and at least 70% of responses was above a similarity threshold of 0.65. We will use a subsample of these videos to investigate the role of MIRCs in human action recognition, e.g., gradually degrading the spatial and temporal information in videos and measuring the impact on action recognition. The derived semantic space and MIRCs will be used to revise MOFO into a more biologically consistent and better performing model.
Electrochemical potentials are essential for cellular life. For instance, cells generate and harness electrochemical gradients to drive a myriad of fundamental processes from nutrient uptake and ATP synthesis to neuronal transduction. To generate and maintain these gradients, all cellular membranes carefully regulate ionic fluxes using a broad array of transport proteins. For that reason, it is also extremely difficult to untangle specific ion transport pathways and link them to membrane potential variations in live cell studies. Conversely, synthetic membrane models, such as black lipid membranes and liposomes, are free of the structural complexity of cells and thus enable to isolate particular ion transport mechanisms and study them under tightly controlled conditions. Still, there is a lack of quantitative methods for correlating ionic fluxes to electrochemical gradient buildup in membrane models. Consequently, the use of these models as a tool for unravelling the coupling between ion transport and electrochemical gradients is limited. We developed a fluorescence-based approach for resolving the dynamic variation of membrane potential in response to ionic flux across giant unilamellar vesicles (GUVs). To gain maximal control over the size and membrane composition of these micron-sized liposomes, we developed an integrated microfluidic platform that is capable of high-throughput production and purification of monodispersed GUVs. By combining our microfluidic platform with quantitative fluorescence analysis, we determined the permeation rate of two biologically important electrolytes – protons (H+) and potassium ions (K+) – and were able to correlate their flux with electrochemical gradient accumulation across the lipid bilayer of single GUVs. Through applying similar analysis principles, we also determined the permeation rate of K+ across two archetypal ion channels, gramicidin A and outer membrane porin F (OmpF). We then showed that the translocation rate of H+ across gramicidin A is four orders of magnitude higher than that of K+ unlike in the case of OmpF where similar transport rates were evaluated for both ions.
This research represents a groundbreaking approach in plant phenotyping by harnessing 3D point clouds generated from video data. Focusing on the comprehensive characterization of plant traits, this method enhances the precision and depth of phenotypic analysis, crucial for advancements in genetics, breeding, and agricultural practices.
Advanced Video Data Capture and Processing for Detailed Segmentation
High-Fidelity Video Acquisition: Capturing detailed video footage of plants under varying environmental conditions forms the foundation of this method. The use of high-resolution cameras allows for capturing minute details crucial for accurate part segmentation.
Rigorous Preprocessing for Optimal Data Quality: Following capture, the video data undergoes meticulous preprocessing. Stabilization, noise filtering, and color correction are performed to ensure that the subsequent segmentation algorithms can accurately identify different parts of the plant.
Segmentation and 3D Point Cloud Generation: The application of state-of-the-art image processing algorithms segments the plant parts within each video frame. Subsequently, photogrammetry and depth estimation techniques create detailed 3D point clouds, effectively capturing the geometry of individual plant components.
Part Segmentation and Trait Measurement for Enhanced Phenotyping
Precise Plant Part Segmentation: This methodology enables the accurate segmentation of individual plant parts, such as leaves, stems, and flowers, within the 3D space. This precise segmentation is crucial for assessing complex plant traits and understanding plant structure in its entirety.
Comprehensive Trait Measurement: The 3D point clouds facilitate comprehensive measurements of plant traits. This includes quantifying leaf area, stem thickness, flower size, and even more subtle features like leaf venation patterns, providing a multi-dimensional view of plant phenotypic traits.
Temporal Tracking for Dynamic Trait Analysis: An integral advantage of using video data is the ability to track and measure these traits over time. This dynamic analysis allows for monitoring growth patterns, developmental changes, and responses to environmental stimuli in a way that static images cannot achieve.
Conclusion: A Breakthrough in Plant Phenotyping and Agricultural Research
This research significantly enhances the capability for detailed plant part segmentation and trait measurement, setting a new standard in plant phenotyping. The level of detail and accuracy afforded by this method offers invaluable insights for agricultural technology, plant genetics, and breeding programs. It represents a critical step forward in our ability to understand and optimize plant characteristics, with far-reaching implications for food production and ecological sustainability.
Our rich, embodied visual experiences of the world involve integrating information from multiple sensory modalities – yet how the brain brings together multiple sensory reference frames to generate such experiences remains unclear. Recently, it has been demonstrated that BOLD fluctuations throughout the brain can be explained as a function of the activation pattern on the primary visual cortex (V1) topographic map. This class of ‘connective field’ models allow us to project V1’s map of visual space into the rest of the brain and discover previously unknown visual organization. Here, we extend this powerful principle to incorporate both visual and somatosensory topographies by explaining BOLD responses during naturalistic movie-watching as a function of two spatial patterns (Connective fields) on the surfaces of V1 and S1. We show that responses in the higher levels of the visual hierarchy are characterized by multimodal topographic connectivity: these responses can be explained as a function of spatially specific activation patterns on both the retinotopic and somatosensory homunculus topographies, indicating that somatosensory cortex participates in naturalistic vision. These novel multimodal tuning profiles are in line with known visual category selectivity, for example for faces and manipulable objects. Our findings demonstrate a scale and granularity of multisensory tuning far more extensive than previously assumed. When inspecting their topographic tuning in S1, we find a full band extrastriate visual cortex from retrosplenial, laterally to the fusiform gyrus, is tiled with somatosensory homunculi. These results demonstrate the intimate integration of information about visual coordinates and body parts in the brain that likely supports visually guided movements and our rich, embodied experience of the world. Finally, we present initial data from a new, densely sampled 7T fMRI movie-watching dataset optimised to shed light on the brain basis of human action understanding.
We do not notice everything in front of us, due to our limited attention capacity. What we attend to forms our conscious experience and is what we retain over time. Thus, creative content creators must strive to direct your attention in different media, from cinema to computer games. To do this they have developed various techniques that involve either directly using centrally presented cues such as arrows or instructions to move attention or rely on image features or so- called “bottom-up” cues that involve manipulating the salience of the parts of an image. Shifting attention usually involves moving our central vision around a screen, but this problem becomes more pronounced in virtual environments where users are free to explore by moving in any direction through it. This can be seen in first- person view screen- based computer video games. Such an experience allows the user to choose how they sample their environment. Often the designer of the environment wishes the user to interact and view certain parts of the scene. In this study we test out a subtle manipulation of visual attention through varying depth of field. Varying depth of field is a cinematic technique that can be implemented in virtual worlds and involves keeping parts of the scene in focus whilst blurring other parts of the scene. We use eye tracking to investigate this technique in a 3D game environment, rendered on a monitor screen. Participants navigated through the environment using keyboard keys and began by freely exploring in the first part and in the second part were instructed to find a target object. We manipulated whether the frames were rendered fully in focus (termed a deep depth of field) or whether a shallow depth of field was applied (where the outer edges of the scene appear blurred. We measured where on the screen participants looked. We divided the screen into 3×3 equal sized regions and calculated the proportion of the time participants spent looking in the central square. On average across all trials participants spent 67% of their fixation time on the central area of the screen. This means that they preferred to navigate by looking in the direction they were heading in. We found that there was a significant difference when freely exploring the scene – participants spent more time looking in the centre of the screen when a shallow depth of field was applied than with a deep depth of field. This was no longer the case during the search task. We demonstrate how these techniques might be effective for manipulating attention by keeping user’s eyes looking straight ahead when they are freely exploring a virtual environment.
People are increasingly consuming video media through the internet. This can lead to a mis-match between the auditory and visual streams due to internet connectivity. For instance in a video of a news anchor reporting a story, there can be time lag between the spoken words and the corresponding movements of the anchor’s mouth and lips (as well as body gestures). This asynchrony between the auditory and visual streams can also arise due to various physical, bio-physical and neural mechanisms, but people are often not aware of these differences. There is accumulating evidence that people adapt to auditory-visual asynchrony at different time scales and for different stimulus categories. However, previous studies often used very simple auditory-visual stimuli (e.g., flashing lights paired with brief tones) or they used short videos of a few seconds. In the current study, we investigated the temporal adaption of continuous speech presented in longer videos. Speech is one of the strongest case for auditory-visual integration, as demonstrated by multi-sensory illusions like the McGurk-McDonald and Ventriloquist effects. To measure temporal adaption of speech videos, we developed a continuous-judgment paradigm in which participants continuously judge over several tens of seconds whether an auditory-visual speech stimulus is synchronous or not. The stimuli consisted of 40 videos (duration: M = 63.3s, SD = 10.3s). For each video, we filmed a close-up (upper body) of one male and one female speaker reporting a news story transcribed from a real news clip (e.g., about the Brexit vote outcome or about Boris Johnson’s resignation). Each speaker reported 20 news stories. We then created seven asynchronous versions of each video by shifting the relative stimulus onset asynchrony (SOA) between the auditory and visual streams between -240ms (auditory stream leading) to +240ms (visual stream leading) in 80ms steps. This included SOA = 0ms (i.e., the original synchronous video). The first 5-10s of all videos were synchronous. For each participant in the continuous-judgment task, we randomly selected 10 videos at each SOA (70 total). Participants continuously judged the synchrony of each video by pressing/releasing the spacebar throughout the duration of the video (response sampling rate = 33ms). The mean proportion perceived synchrony across the duration of the videos were calculated from participants’ continuous responses after the initial synchronous period. For the auditory-leading videos (SOAs 0ms), participants initially showed a drop in proportion perceived synchrony but this proportion increased over time, suggesting that they were adapting to the asynchrony over time. The magnitude of temporal adaptation depended on the SOA, with the largest SOA producing the largest adaptation. Consistent with previous studies, our findings suggest that temporal adaptation occurs for long, continuous speech videos but only when the visual stream leads the auditory stream.
In the evolving landscape of traffic management and autonomous driving technology, the analysis of traffic scenes from video data stands as a crucial challenge. Traditional approaches often rely on complex, high-dimensional image analysis, necessitating significant computational resources and sophisticated algorithms. Recognizing the limitations of these methods, our research introduces a novel, streamlined approach centered around a graph-based framework for understanding traffic dynamics.
Central to our methodology is the exploration of complex scene analysis through the lens of object-object interaction within traffic scenes. This interaction dynamics is adeptly captured through our specially designed graph structures, which are further analyzed and interpreted using Graph Neural Networks (GNNs) as a foundational element. By employing GNNs, our framework delves into the intricate dynamics of traffic environments. We focus on the high-level interactions and behaviours within traffic scenes, distilling the essential patterns of movement and relationships among elements such as vehicles and pedestrians.
To validate the effectiveness of our framework, we conducted extensive testing using two prominent datasets: the METEOR Dataset and the INTERACTION Dataset. Our methodology demonstrated exceptional performance, achieving an accuracy of 62.03% on the METEOR Dataset and an impressive 98.50% on the INTERACTION Dataset. These results underscore the capability of our graph-based approach to accurately interpret and analyze the dynamics of traffic scenes.
Through this rigorous evaluation, our research not only showcases the significant advantages of incorporating graph neural networks for traffic scene analysis but also highlights the power of our novel approach in abstracting and understanding the complex patterns of movement and interactions within traffic environments. Our work sets a new benchmark in the field, offering a promising direction for future advancements in traffic management and autonomous vehicle technologies.
In the field of person re-identification (re-ID), accurately matching individuals across different camera views poses significant challenges due to variations in pose, illumination, viewpoint, and notably, scale. Traditional methods in re-ID have focused on robust feature descriptor generation and sophisticated metric learning, yet they often fall short in addressing scale variations effectively. In this work, we introduce a novel approach to scale-invariant person re-ID through the development of our scale-invariant residual networks coupled with an innovative batch adaptive triplet loss function for enhanced deep metric learning. The first network, termed Scale-Invariant Triplet Network (SI-TriNet), leverages pre-trained weights to form a deeper architecture, while the second, Scale-Invariant Siamese Resnet-32 (SISR-32), is a shallower structure trained from scratch. These networks are adept at handling scale variations, a common yet challenging aspect in re-ID tasks, by employing scale-invariant (SI) convolution techniques that ensure robust feature detection across multiple scales. This is complemented by our proposed batch adaptive triplet loss function that refines the metric learning process, dynamically prioritizing learning from harder positive samples to improve the model’s discriminatory capacity. Extensive evaluation on benchmark datasets Market-1501 and CUHK03 demonstrates the superiority of our proposed methods over existing state-of-the-art approaches. Notably, SI-TriNet and SISR-32 show significant improvements in both mean Average Precision (mAP) and rank-1 accuracy metrics, affirming the efficacy of our scale-invariant architectures and the novel loss function in addressing the complexities of person re-ID. This study not only advances the understanding of scale-invariant feature learning in deep networks but also sets a new benchmark in the person re-ID domain, promising more accurate and scalable solutions for real-world surveillance and security applications.
As humans move around, performing their daily tasks, they are able to recall where they have positioned objects in their environment, even if these objects are currently out of sight. In this paper, we aim to mimic this spatial cognition ability. We thus formulate the task of Out of Sight, Not Out of Mind – 3D tracking active objects using observations captured through an egocentric camera. We introduce Lift, Match and Keep (LMK), a method which lifts partial 2D observations to 3D world coordinates, matches them over time using visual appearance, 3D location and interactions to form object tracks, and keeps these object tracks even when they go out-of-view of the camera – hence keeping in mind what is out of sight.
We test LMK on 100 long videos from EPIC-KITCHENS. Our results demonstrate that spatial cognition is critical for correctly locating objects over short and long time scales. E.g., for one long egocentric video, we estimate the 3D location of 50 active objects. Of these, 60% can be correctly positioned in 3D after 2 minutes of leaving the camera view.
The interpretation of social interactions between people is important in many daily situations. In this talk, we will present the results of 2 studies examining the visual perception of other people interacting. The first study used functional brain imaging to investigate the brain regions involved in the incidental visual processing of social interactions; that is, the processing of the body movements outside the observers’ focus of attention. The second study used a visual search paradigm to test whether people are better able to find interacting than non-interacting people in a crowd.
In the first study, we measured brain activation while participants (N = 33) were presented with point-light dyads portraying communicative interactions or individual actions. These types of stimuli allowed us to investigate the role of motion in processing social interactions by removing form cues. Participants in our study discriminated the brightness of two crosses also on the screen, thus excluding the body movements from the participant’s task-related focus of attention. To investigate brain regions that may process the spatial and temporal relationships between the point-light displays, we either reversed the facing direction of one agent or spatially scrambled the local motion of the points. Incidental processing of communicative interactions elicited activation in right anterior STS only when the two agents were facing each other. Controlling for differences in local motion by subtracting brain activation to scrambled versions of the point-light displays revealed significant activation in parietal cortex for communicative interactions, as well as left amygdala and brain stem/cerebellum. Our results complement previous studies and suggest that additional brain regions may be recruited to incidentally process the spatial and temporal contingencies that distinguish people acting together from people acting individually.
Our second study focussed on deliberate visual processing of communicative interactions in the observer’s focus of attention. Participants viewed arrays of the same point-light dyads used in our first study, but here they searched for an interacting dyad amongst a set of independently acting dyads, or for an independently acting dyad amongst a set of interacting dyads, by judging whether a target dyad was present or absent (targets were present on half the trials). In each of two experiments (N=32 and N=49), participants were faster and more accurate to detect the presence of interacting than independently acting target dyads. Moreover, visual search for interacting target dyads was more efficient than for independently acting target dyads, as indicated by shallower search slopes (increase in response time with increasing number of distractors) for the former as for the latter. In the second experiment, we measured the eye movements of the participants using an eye tracker. The analyses of the eye tracking data are ongoing. Based on the results from our first study and on search performance, we expect that fixation duration on communicative-dyad targets will be shorter than on independent-dyad targets, because less attentional focus (as measured by fixation duration) is needed to process social interactions.