Windsurf Launches it's SWE Model

PLUS: OpenAI's Agent Building Guide + Google AlphaEvolve

Welcome back to Daily AI Skills.

Here’s what we are covering today:
1. Google Debuts AlphaEvolve
2. Windsurf SWE 1
3. LLMs struggle with Multi-Turn Conversations

+ OpenAI Agent Building Guide

Google Debuts AlphaEvolve

Google has introduced AlphaEvolve, a new coding agent that combines Gemini models with evolutionary algorithms to tackle complex scientific and computational problems, enhancing internal efficiency and cracking long-standing math puzzles.

Here’s how it works:

  • AlphaEvolve leverages different Gemini models—using Flash for generating ideas and Pro for in-depth analysis—to write code, which is then tested and refined through iterative evolution.

  • It has already made breakthroughs in mathematics, including the first advancement on Strassen’s 1969 matrix multiplication algorithm.

  • The system is also being used internally at Google to optimise data centre scheduling, accelerate AI training (even for its own models), and assist with chip design.

  • In tests on over 50 unsolved math problems, AlphaEvolve matched state-of-the-art solutions in 75% of cases and found entirely new, superior ones in another 20%.

Windsurf Unveils SWE-1: AI for Full-Stack Software Engineering

AI coding platform Windsurf has unveiled SWE-1, its first proprietary family of AI models tailored to support the entire software engineering lifecycle, not just code generation.

Here’s what’s new:

  • The SWE-1 lineup includes three models: SWE-1 (a full-size version for paid users), SWE-1-lite (now replacing Cascade Base for all users), and SWE-1-mini.

  • Internal benchmarks show SWE-1 outperforms all non-frontier and open-source models, ranking just below top-tier systems like Claude 3.7 Sonnet.

  • Unlike traditional code-centric models, SWE-1 is designed to operate across multiple interfaces—editors, terminals, and browsers.

  • A key innovation is its “flow awareness” system, which maintains a shared timeline between the user and AI, enabling smooth collaboration and task handoffs throughout development.

Check out the full announcement here: https://windsurf.com/blog/windsurf-wave-9-swe-1

LLMs Struggle with Multi-Turn Conversations

A new study by researchers from Microsoft and Salesforce reveals that large language models (LLMs) struggle in multi-turn conversations where instructions are revealed gradually, often losing track and failing to recover effectively.

Key findings:

  • The team evaluated 15 top LLMs—including GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.5 Pro—across six generative tasks.

  • While models performed well in single-turn interactions (around 90% success), their accuracy dropped to about 60% in longer, multi-step conversations.

  • LLMs frequently “got lost” by making premature assumptions, acting before collecting enough information, or compounding early mistakes.

  • Adjusting temperature settings or using different reasoning methods didn’t significantly improve performance—volatility remained high, even among the best models.

Read the full research paper here: https://arxiv.org/pdf/2505.06120

A practical guide to building agents by OpenAI

📩Forward it to people you know who are keeping pace with the changing AI world, and stay tuned for the next edition to stay ahead of the curve!