Audio-to-talking face generation stands at the forefront of advancements in generative AI. It bridges the gap between audio and visual representations by generating synchronized and realistic talking faces. This significantly improves human-computer interaction and content accessibility for diverse audiences. Despite substantial research in this area, critical challenges such as the lack of realistic facial animations, inaccurate audio-lip synchronization, and intensive computational demands continue to restrict the practicality of the talking face generation methods applications. To address these issues, we introduce a novel approach leveraging the emerging capabilities of Stable diffusion models and vision Transformers for Talking face generation (StableTalk). By incorporating the Re-attention mechanism and adversarial loss into StableTalk, we have markedly enhanced the audio-lip alignment and the consistency of facial animations across frames. More importantly, we have optimized computational efficiency by refining operations within the latent space and dynamically adjusting the visual focus based on the given conditions. Our experimental results demonstrate that StableTalk surpasses existing methods in terms of image quality, audio-lip synchronization, and computational efficiency.