AI Resources - Part 1

A collection of interesting AI tools, products, resources, papers, and more I’ve come across.

A collection of interesting AI tools, products, resources, papers, and more I’ve come across.

tl;dr sec #225

  • Everyone says they’re an AI startup

  • Often it’s not clear which will win- the weak form and strong form

  • Everyone can integrate into the same APIs, what’s the defensibility over time?

  • There’s a real difference between AI apps and foundational models

  • Lots of growth driven by novelty but what about retention?

  • Hard to pick winners when it’s early. Mobile wave with Flipboard and Foursquare. Better to wait

  • Lack of proven business models

  • Concerns about hype and overvaluation

Matt Shumer: claude-prompt-engineer
(Repo) Just describe a task, and a chain of AIs will:

  • Generate many possible prompts

  • Test them in a ranked tournament

  • Return the best one

  • “Often, they outperform the prompts I'd write by hand (especially when I ask it to generate and compare 10+ prompts).”

Eladlev/AutoPrompt
A prompt optimization framework designed to enhance and perfect your prompts for real-world use cases.

The framework automatically generates high-quality, detailed prompts tailored to user intentions. It employs a refinement (calibration) process, where it iteratively builds a dataset of challenging edge cases and optimizes the prompt accordingly. This approach not only reduces manual effort in prompt engineering but also effectively addresses common issues such as prompt sensitivity and inherent prompt ambiguity issues.

Companies

  • CodeRabbit: “AI-first Code Reviewer.” Line by line reviews, issue validation, PR summarization,

  • Tusk.ai: AI-created pull requests for annoying tickets. From a Jira ticket or GitHub issue, it’ll automatically change website copy, adjust the UI, and make other small changes for you.

After AI beat them, professional Go players got better and more creative
Go player quality was plateauing from 1950s to mid 2010s. After DeepMind demonstrated AlphaGo in May 2016, after a few years, the weakest professional players were better than the strongest players before AI. The strongest players pushed beyond what had been thought possible.

It wasn’t simply that they imitated the AI, in a mechanical way. They got more creative, too. There was an uptick in historically novel moves and sequences. Shin et al calculate about 40 percent of the improvement came from moves that could have been memorized by studying the AI. But moves that deviated from what the AI would do also improved, and these “human moves” accounted for 60 percent of the improvement.

Something is considered impossible. Then somebody does it. Soon it is standard. This is a common pattern. Until Roger Bannister ran the 4-minute mile, the best runners clustered just above 4 minutes for decades. A few months later Bannister was no longer the only runner to do a 4-minute mile. These days, high schoolers do it.

When DeepBlue beat the chess world champion Kasparov in 1997, it was assumed this would be a blow to human chess players. It wasn’t. Chess became more popular than ever. And the games did not become machine-like and predictable. Instead, top players like Magnus Carlsen became more inventive than ever.

tl;dr sec #224

Anatomy of OpenAI's Developer Community
A Jupyter notebook analyzing a dump of 100K+ posts in the OpenAI Discourse. Core topics: the API, GPT builders, prompting, and more.

Choose Your Weapon: Survival Strategies for Depressed AI Academics
Now that modern AI research requires millions to train big models.

  • Give up

  • Try scaling anyway

  • Scale down

  • Reuse and remaster

  • Analysis instead of synthesis

  • RL! No Data!

  • Small models! No Compute!

  • Work on specialized application areas or domains

  • Solve problems few care about (for now)

  • Try things that shouldn’t work

  • Do things that have bad optics

  • Start it up; spin it out!

  • Collaborate, or jump ship

meistrari/prompts-royale - Automatically create prompts and make them fight each other to know which is the best.

AgentOps-AI/agentops - Python SDK for agent evals and observability. Build your next agent with benchmarks, observability, and replay analytics. AgentOps is the toolkit for evaluating and developing robust and reliable AI agents.

DAGWorks-Inc/burr
Build applications that make decisions based on state (chatbots, agents, simulations, etc...) from simple Python building blocks. Monitor, persist, and execute on your own infrastructure. Includes a UI that can track/monitor agent decisions in real time.

princeton-nlp/SWE-agent
SWE-agent turns LMs (e.g. GPT-4) into software engineering agents that can fix bugs and issues in real GitHub repositories. On the full SWE-bench test set, SWE-agent resolves 12.29% of issues, achieving the state-of-the-art performance on the full test set.

plandex-ai/plandex
An open source, terminal-based AI coding engine for complex tasks. Plandex uses long-running agents to complete tasks that span multiple files and require many steps. It breaks up large tasks into smaller subtasks, then implements each one, continuing until it finishes the job. It helps you churn through your backlog, work with unfamiliar technologies, get unstuck, and spend less time on the boring stuff.

OpenDevin/OpenDevin
An open-source project aiming to replicate Devin, an autonomous AI software engineer who is capable of executing complex engineering tasks and collaborating actively with users on software development projects.

Sequoia’s AI Ascent 2024 YouTube Playlist
A series of short talks from cool folks. Some of the talks that stood out:

1/ AgentOps Agents are slow, expensive, and unreliable. AgentOps is fixing that. Track, test, and benchmark AI agents from prototype to production

@AlexReibman

@AgentOpsAI

@AtomSilverman

@siyangqiu


2/ Reworkd AI agents for navigating the web and scraping data Introducing: Tarsier— an open source framework that combines web scraping and OCR to extract text from web pages for the consumption of LLMs

@asimdotshrestha

@khoomeik

@ReworkdAI

4/ Deepunit AI agent for automatically developing unit tests. Give this agent your repo and get complete code coverage over your entire project

@GPTJustin

@stateof_kate

@DeepUnitAI

5/ Deepgram Conversational AI tools for building voice bots and agents. Comes complete with realistic, low-latency voices

@DeepgramAI

7/ Composio Extremely simple integrations and tools for outfitting AI agents Building an AI agent to handle linear + github issues in 3 minutes

@KaranVaidya6

9/ OpenPipe Fine tune LLMs faster and 14x cheaper than OpenAI Outfit agents with faster, cheaper language models at scale

@fly_north

@corbtt

@OpenPipeAI

tl;dr sec #224

  • semanser/codel - Fully autonomous AI Agent that can perform complicated tasks and projects using terminal, browser, and editor.

  • misbahsy/RAGTune - An automated tuning and optimization tool for the RAG (Retrieval-Augmented Generation) pipeline. This tool allows you to evaluate different LLMs (Large Language Models), embedding models, query transformations, and rerankers. Twitter overview.

Introducing DBRX: A New State-of-the-Art Open LLM
From Databricks. “According to our measurements, it surpasses GPT-3.5, and it is competitive with Gemini 1.0 Pro. It is an especially capable code model, surpassing specialized models like CodeLLaMA-70B on programming, in addition to its strength as a general-purpose LLM.”

FlyFlow
“Optimize your LLM usage with 5x faster queries, 3x lower price, and the same quality as GPT4 using fine tuned models on autopilot. One line integration by changing a URL.” Demo page: “Flyflow offers fine tuning as a service. We proxy all of your GPT4 / Claude3 traffic, collect the responses, and use them to fine tune a smaller, faster, and cheaper model that matches GPT4 quality.”

tl;dr sec #222

KhoomeiK/LlamaGym - Fine-tune LLM agents with online reinforcement learning

bananaml/fructose - A Python package to create a dependable, strongly-typed interface around an LLM call. Just slap the @ai() decorator on a type-annotated function and call it as you would a function.

  • relari-ai/continuous-eval - “Open-Source Evaluation for GenAI Application Pipelines.“

  • Arize - “The AI Observability & LLM Evaluation Platform”

  • Hix.ai - “Bypass AI Detection With Our Undetectable AI Tool”

  • BypassGPT - “100% Undetectable AI to Bypass AI Detection”

Dan Shipper and Dave Clark (film director) video walk through of creating a movie with AI

tl;dr sec #220

HyperWriteAI Agent Studio - A Chrome extension that lets you record a task which it can then replay.

tl;dr sec #219

  • samber/the-great-gpt-firewall - A curated list of websites bu Samuel Berthe that restrict access to AI Agents, AI crawlers and GPTs.

  • Run Llama 2 uncensored locally 

  • AutoFineTune - (thread) Easily fine-tune a small model with synthetically generated data. Generates 100+ synthetic message pairs with a GPT-4 loop and fine-tunes llama-2-7b with Together AI.

  • Gemini 1.5 Announcement - Uses a Mixture-of-Experts architecture, comes with a standard 128,000 token context window but there’s a limited preview with a context window of up to 1 million tokens (1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words.).

    • “In the Needle In A Haystack (NIAH) evaluation, where a small piece of text containing a particular fact or statement is purposely placed within a long block of text, 1.5 Pro found the embedded text 99% of the time, in blocks of data as long as 1 million tokens.”

    • “Gemini 1.5 Pro also shows impressive “in-context learning” skills, meaning that it can learn a new skill from information given in a long prompt, without needing additional fine-tuning. We tested this skill on the Machine Translation from One Book (MTOB) benchmark, which shows how well the model learns from information it’s never seen before.”

The killer app of Gemini Pro 1.5 is video
Simon Willison shares his experience playing around with Gemini Pro 1.5, and how it can take as input a quick video of his bookshelf and return the titles and authors as JSON.

AdGen AI - AI-generated creatives that perform.

tl;dr sec #218

  • traceloop/openllmetry-js - Open-source observability for your LLM application, based on OpenTelemetry.

  • ferrislucas/promptr - A CLI tool that lets you use plain English to instruct GPT-3 or GPT-4 to make changes to your codebase.

  • Deeptechia/geppetto - An advanced Slack bot integrating OpenAI's ChatGPT-4 and DALL-E-3 for interactive AI conversations and image generation. Enhances Slack communication with automated greetings, coherent responses, and creative visualizations.

  • lllyasviel/Fooocus - Image generating software, based on Gradio. Like Stable Diffusion, it’s offline, open source, and free. Like Midjourney, manual tweaking is not needed, users only need to focus on the prompts and images.

The System Prompt for ChatGPT
It’s interesting that it’s mostly just normal English instruction, no crazy prompt engineering.

Better Call GPT, Comparing Large Language Models Against Lawyers
Paper: “Our empirical analysis benchmarks LLMs against a ground truth set by Senior Lawyers, uncovering that advanced models match or exceed human accuracy in determining legal issues. In speed, LLMs complete reviews in mere seconds, eclipsing the hours required by their human counterparts. Cost wise, LLMs operate at a fraction of the price, offering a staggering 99.97 percent reduction in cost over traditional methods.”

tl;dr sec #217

  • screenshot-to-code - Drop in a screenshot and convert it to clean code (HTML/Tailwind/React/Vue).

  • wishful-search - A natural language search module for JSON arrays by Hrishi Olickel. Take any JSON array you have (notifications, movies, flights, people) and filter it with complex questions. WishfulSearch takes care of the prompting, database management, object-to-relational conversion and query formatting.

  • ElevenLabs Speech-to-Speech - Say it how you want it and transform your voice into another character, with full control over emotions, timing, and delivery.

tl;dr sec #216

Why you should invest in AI
Sarah Guo makes the case for why you should invest your time and attention in AI.

The next grand challenge for AI
Jim Fan presents the next grand challenge in the quest for AI: the "foundation agent," which would seamlessly operate across both the virtual and physical worlds.

Enhancing Lecture Notes with AI
A student describes how they record and live transcribe lectures, then pass the transcript to an LLM to get summary notes, in addition to their main hand-written notes.

LangGraph for multi-agent workflows
New functionality from LangChain that makes it easy to construct multi-agent workflows: each node is an agent, and the edges represent how they communicate.

tl;dr sec #215

Image Generation

KillianLucas/aifs
Local semantic search over folders. It will chunk and embed all nested supported files (txt, docx, pptx, jpg, png, eml, html, pdf).

  1. Start with the most powerful model for your app’s use case (likely GPT-4). You want the best quality output so you can fine tune a smaller model.

  2. Store your AI requests/responses so they can be easily exported. He uses @helicone_ai, which you can easy swap-in with OpenAI APIs and it stores all of your AI requests in an exportable table.

  3. After you’ve collected ~100-500+ request/response pairs, export them and clean the data so that the inputs and outputs are of high quality. You can also leverage feedback from users (e.g. thumbs up/thumbs down) if you have it.

  4. With the clean dataset, use a hosted OSS AI service like Together or Anyscale to fine-tune Mixtral 8x7B. He’s gotten better results with these than fine tuning GPT-3.5-Turbo on OpenAI.

  5. Swap out GPT-4 with the fine-tuned model.

tl;dr sec #214

Products

  • EverArt - Train on your style, and then generate marketing assets, photoshoots, visualize new products, packaging, and more.

  • Monica.im - All-in-one AI Assistant.

tl;dr sec #213

The “Lever” prompting technique
From The Prompt Warrior: Whenever ChatGPT goes 'too far' or 'not far enough' with something, for example:

  • Tone too formal

  • Summarization too brief

  • Brainstorming not creative enough

Just do this:

  1. Ask it to rate the output on a scale of 1-10 (define 1 and 10)

  2. Then adjust to your desired number

On a scale of 1-10.

If 1 is a ...
And 10 is a ...

How would you rate this []?

tl;dr sec #212

ByteDance announces StemGen: A music generation model that listens
“Most models concentrate on generating fully mixed music in response to abstract conditioning information. In this work, we present an alternative paradigm for producing music generation models that can listen and respond to musical context.” (paper)

OpenAI’s Official Prompt Engineering Guide
Six strategies and tactics in each for getting better results, including:

  • Write clear instructions

    • Include details in your query to get more relevant answers

    • Ask the model to adopt a persona

    • Use delimiters to clearly indicate distinct parts of the input

    • Specify the steps required to complete a task

    • Provide examples

    • Specify the desired length of the output

  • Provide reference text

    • Instruct the model to answer using a reference text

    • Instruct the model to answer with citations from a reference text

  • Split complex tasks into simpler subtasks

    • Use intent classification to identify the most relevant instructions for a user query

    • For dialogue applications that require very long conversations, summarize or filter previous dialogue

    • Summarize long documents piecewise and construct a full summary recursively

  • Give the model time to "think"

    • Instruct the model to work out its own solution before rushing to a conclusion

    • Use inner monologue or a sequence of queries to hide the model's reasoning process

    • Ask the model if it missed anything on previous passes

  • Use external tools

    • Use embeddings-based search to implement efficient knowledge retrieval

    • Use code execution to perform more accurate calculations or call external APIs

    • Give the model access to specific functions

  • Test changes systematically

    • Evaluate model outputs with reference to gold-standard answers

Agents

  • Autogen - OSS multi-agent conversation framework by Microsoft. Has some neat examples on their blog.

  • crewai - An OSS framework for orchestrating role-playing, autonomous agents.

  • E2B - Secure sandboxed cloud environments made for AI agents and AI apps. They’ve open sourced most of the underlying code.

  • Steamship - “The development platform for AI Agents.” Build AI Agents with their Python SDK, and effortlessly deploy them to the cloud. Gain access to serverless cloud hosting, vector search, webhooks, callbacks, and more.

  • Lindy.ai - “Meet your AI employee.” A no-code product aiming to make it easy to create a team of various AI agents using only English description of how they should behave (their prompt).

    • Without yet looking into it deeply, what seems to differentiate Lindy vs the other agent platforms is that it appears aimed at non-developer audiences and it seems to focus on having many integations, like Zapier, that make it easy to have agents interact with your calendar, email, GitHub, or whatever other systems you’re using.

  • Relevance AI - No code “build your AI workforce” platform.

  • AgentGPT - An autonomous AI Agent platform that empowers users to create and deploy customizable autonomous AI agents directly in the browser.

  • AgentRunner - “Create autonomous AI agents.”

  • research-agents-3.0 - Repo demonstrating Autogen + GPTs to build a group of AI researchers.

The State of AI Agents
Great roundup by the E2B folks on products built on top of agents, their challenges, standardization, and more, with some useful overview diagrams of many players in the space.

TIL about: “The Agent Protocol, adopted in the AutoGPT benchmarks, is a tech stack agnostic way to standardize and hence benchmark and compare AI agents.”

tl;dr sec #211

How to tackle unreliability of coding assistants
Thoughtworks’ Birgitta Böckeler shares some useful questions to ask yourself and perspective on how to think about coding with LLMs:

  • Do I have a quick feedback loop?

    • Can you verify quickly if the LLM output is correct or if it’s wasting your time?

    • Syntax highlighting, tests, run and observe behavior.

  • Do I have a reliable feedback loop?

  • What is the margin of error?

  • Do I need very recent info?

“If the AI assistants are unreliable, than why would I use them in the first place?”. There is a mindset shift we have to make when using Generative AI tools in general. We cannot use them with the same expectations we have for “regular” software. GitHub Copilot is not a traditional code generator that gives you 100% what you need. But in 40-60% of situations, it can get you 40-80% of the way there, which is still useful. When you adjust these expectations, and give yourself some time to understand the behaviours and quirks of the eager donkey, you’ll get more out of AI coding assistants.

tl;dr sec #210

  • Giskard - The testing framework for ML models. See also promptfoo.

  • HeyGen - AI-powered video creations at scale. New features: instant avatar (create an AI version of yourself), and translate you speaking in videos to another language.

  • Meet Aitana: The first Spanish AI model earning up to $11K/month. The thread includes some links to useful tutorials and guides.

  • Noiselith: Desktop app for Stable Diffusion XL so you can easily run it locally, offline.

  • AutoGen's TeachableAgent: New Autogen blog post that includes examples. TeachableAgent uses TextAnalyzerAgent so that users can teach their LLM-based assistants new facts, preferences, and skills.

tl;dr sec #209

Quicklinks

GPTs

AI + Music, Images, or Video

  • Scribble Diffusion: Turn your sketch into a refined image using AI

  • Dall-E Party: Recursively generate an image with DALL-E 3, describe it with GPT4 Vision, use that description with DALL-E 3, …

  • People think white AI-generated faces are more real than actual photos, study says - Attractiveness and "averageness" of AI-generated faces made them seem more real to the study participants, while the large variety of proportions in actual faces seemed unreal.

  • Frigate: Monitor your security cameras with locally processed AI.

  • Script that takes pics using your webcam and describes you like David Attenborough using GPT-4 Vision and ElevenLabs. Worth watching the demo video.

  • Introducing Stable Video Diffusion - The first foundation model for generative video based on the image model Stable Diffusion.

  • Meta brings us closer to AI-generated movies: Given a caption, image or a photo paired with a description, Emu Video can generate a 4 second animated clip. A complimentary tool can then edit those clips using natural language- “the same clip, but in slow motion.”

  • New music model from Google DeepMind: “With our music AI tools, users can create new music or instrumental sections from scratch, transform audio from one music style or instrument to another, and create instrumental and vocal accompaniments.” A limited set of creators will also be able to generate a unique soundtrack in the voice and style of participating artists like Charlie Puth, Demi Lovato, Sia, T-Pain, and more.

LLMs cannot find reasoning errors, but can correct them!
Paper in which the authors break down the self-correction process into two core components: mistake finding and output correction. They find that LLMs generally struggle with finding logical mistakes, but for output correction, they propose a backtracking method which provides large improvements when given information on mistake location.

Outset is using GPT-4 to make user surveys better
YC-backed Outset uses GPT-4 to autonomously conduct and synthesize user surveys. Outset users create a survey and share the link with prospective survey takers, then Outset follows up with respondents to clarify, probe on answers and create a “conversational rapport” for deeper responses. Outset enabled WeightWatchers to conduct and synthesize over 100 interviews in 24 hours.

OpenAI Drama

AI Explained had a nice series of videos about it:

Altman’s polarizing past hints at OpenAI board’s reason for firing him
Previously Y Combinator founder Paul Graham gave Sam the boot from leading YC. Sam “had developed a reputation for favoring personal priorities over official duties and for an absenteeism that rankled his peers and some of the start-ups he was supposed to nurture.”

Re: the new OpenAI board: “Altman was unwilling to talk to anyone he didn’t already know. By Sunday, it became clear that Altman wanted a board composed of a majority of people who would let him get his way.”

“One person who has worked closely with Altman described a pattern of consistent and subtle manipulation that sows division between individuals.”

“A former OpenAI employee, machine learning researcher Geoffrey Irving, who now works at competitor Google DeepMind, wrote that he was disinclined to support Altman after working for him for two years. “1. He was always nice to me. 2. He lied to me on various occasions 3. He was deceptive, manipulative, and worse to others, including my close friends (again, only nice to me, for reasons).””

Exclusive: OpenAI researchers warned board of AI breakthrough ahead of CEO ouster, sources say
Supposedly several staff researchers at OpenAI wrote a letter to the board of directors a warning of a powerful AI discovery that could threaten humanity. Allegedly there was a project, Q*, that was able to solve certain math problems, implying it might have great reasoning capabilities than just predicting the next word. This could be applied to novel scientific research, for instance.

This may have been what Sam Altman meant when he said being in the room “where we push the veil of ignorance back and the frontier of discovery forward.”

OpenAI’s Misalignment and Microsoft’s Gain
Stratechery deep dive on the implications of OpenAI’s non-profit model and governance situation, internal cultural dynamics at OpenAI, Microsoft’s role, Altman’s reputation, and thoughts going forward.