Show HN: OpenCastor Agent Harness Evaluator Leaderboard
3 by craigm26 | 0 comments on Hacker News.
I've been building OpenCastor, a runtime layer that sits between a robot's hardware and its AI agent. One thing that surprised me: the order you arrange the skill pipeline (context builder → model router → error handler, etc.) and parameters like thinking_budget and context_budget affect task success rates as much as model choice does. So I built a distributed evaluator. Robots contribute idle compute to benchmark harness configurations against OHB-1, a small benchmark of 30 real-world robot tasks (grip, navigate, respond, etc.) using local LLM calls via Ollama. The search space is 263,424 configs (8 dimensions: model routing, context budget, retry logic, drift detection, etc.). The demo leaderboard shows results so far, broken down by hardware tier (Pi5+Hailo, Jetson, server, budget boards). The current champion config is free to download as a YAML and apply to any robot. P66 safety parameters are stripped on apply — no harness config can touch motor limits or ESTOP logic. Looking for feedback on: (1) whether the benchmark tasks are representative, (2) whether the hardware tier breakdown is useful, and (3) anyone who's run fleet-wide distributed evals of agent configs for robotics or otherwise.
Hack Nux
Watch the number of websites being hacked today, one by one on a page, increasing in real time.
New Show Hacker News story: Show HN: Cq – Stack Overflow for AI coding agents
Show HN: Cq – Stack Overflow for AI coding agents
21 by peteski22 | 4 comments on Hacker News.
Hi all, I'm Peter at Staff Engineer and Mozilla.ai and I want to share our idea for a standard for shared agent learning, conceptually it seemed to fit easily in my mental model as a Stack Overflow for agents. The project is trying to see if we can get agents (any agent, any model) to propose 'knowledge units' (KUs) as a standard schema based on gotchas it runs into during use, and proactively query for existing KUs in order to get insights which it can verify and confirm if they prove useful. It's currently very much a PoC with a more lofty proposal in the repo, we're trying to iterate from local use, up to team level, and ideally eventually have some kind of public commons. At the team level (see our Docker compose example) and your coding agent configured to point to the API address for the team to send KUs there instead - where they can be reviewed by a human in the loop (HITL) via a UI in the browser, before they're allowed to appear in queries by other agents in your team. We're learning a lot even from using it locally on various repos internally, not just in the kind of KUs it generates, but also from a UX perspective on trying to make it easy to get using it and approving KUs in the browser dashboard. There are bigger, complex problems to solve in the future around data privacy, governance etc. but for now we're super focussed on getting something that people can see some value from really quickly in their day-to-day. Tech stack: * Skills - markdown * Local Python MCP server (FastMCP) - managing a local SQLite knowledge store * Optional team API (FastAPI, Docker) for sharing knowledge across an org * Installs as a Claude Code plugin or OpenCode MCP server * Local-first by default; your knowledge stays on your machine unless you opt into team sync by setting the address in config * OSS (Apache 2.0 licensed) Here's an example of something which seemed straight forward, when asking Claude Code to write a GitHub action it often used actions that were multiple major versions out of date because of its training data. In this case I told the agent what I saw when I reviewed the GitHub action YAML file it created and it proposed the knowledge unit to be persisted. Next time in a completely different repo using OpenCode and an OpenAI model, the cq skill was used up front before it started the task and it got the information about the gotcha on major versions in training data and checked GitHub proactively, using the correct, latest major versions. It then confirmed the KU, increasing the confidence score. I guess some folks might say: well there's a CLAUDE.md in your repo, or in ~/.claude/ but we're looking further than that, we want this to be available to all agents, to all models, and maybe more importantly we don't want to stuff AGENTS.md or CLAUDE.md with loads of rules that lead to unpredictable behaviour, this is targetted information on a particular task and seems a lot more useful. Right now it can be installed locally as a plugin for Claude Code and OpenCode: claude plugin marketplace add mozilla-ai/cq claude plugin install cq This allows you to capture data in your local ~/.cq/local.db (the data doesn't get sent anywhere else). We'd love feedback on this, the repo is open and public - so GitHub issues are welcome. We've posted on some of our social media platforms with a link to the blog post (below) so feel free to reply to us if you found it useful, or ran into friction, we want to make this something that's accessible to everyone. Blog post with the full story: https://ift.tt/BpK0C9W GitHub repo: https://ift.tt/xKB2Fjr Thanks again for your time.
21 by peteski22 | 4 comments on Hacker News.
Hi all, I'm Peter at Staff Engineer and Mozilla.ai and I want to share our idea for a standard for shared agent learning, conceptually it seemed to fit easily in my mental model as a Stack Overflow for agents. The project is trying to see if we can get agents (any agent, any model) to propose 'knowledge units' (KUs) as a standard schema based on gotchas it runs into during use, and proactively query for existing KUs in order to get insights which it can verify and confirm if they prove useful. It's currently very much a PoC with a more lofty proposal in the repo, we're trying to iterate from local use, up to team level, and ideally eventually have some kind of public commons. At the team level (see our Docker compose example) and your coding agent configured to point to the API address for the team to send KUs there instead - where they can be reviewed by a human in the loop (HITL) via a UI in the browser, before they're allowed to appear in queries by other agents in your team. We're learning a lot even from using it locally on various repos internally, not just in the kind of KUs it generates, but also from a UX perspective on trying to make it easy to get using it and approving KUs in the browser dashboard. There are bigger, complex problems to solve in the future around data privacy, governance etc. but for now we're super focussed on getting something that people can see some value from really quickly in their day-to-day. Tech stack: * Skills - markdown * Local Python MCP server (FastMCP) - managing a local SQLite knowledge store * Optional team API (FastAPI, Docker) for sharing knowledge across an org * Installs as a Claude Code plugin or OpenCode MCP server * Local-first by default; your knowledge stays on your machine unless you opt into team sync by setting the address in config * OSS (Apache 2.0 licensed) Here's an example of something which seemed straight forward, when asking Claude Code to write a GitHub action it often used actions that were multiple major versions out of date because of its training data. In this case I told the agent what I saw when I reviewed the GitHub action YAML file it created and it proposed the knowledge unit to be persisted. Next time in a completely different repo using OpenCode and an OpenAI model, the cq skill was used up front before it started the task and it got the information about the gotcha on major versions in training data and checked GitHub proactively, using the correct, latest major versions. It then confirmed the KU, increasing the confidence score. I guess some folks might say: well there's a CLAUDE.md in your repo, or in ~/.claude/ but we're looking further than that, we want this to be available to all agents, to all models, and maybe more importantly we don't want to stuff AGENTS.md or CLAUDE.md with loads of rules that lead to unpredictable behaviour, this is targetted information on a particular task and seems a lot more useful. Right now it can be installed locally as a plugin for Claude Code and OpenCode: claude plugin marketplace add mozilla-ai/cq claude plugin install cq This allows you to capture data in your local ~/.cq/local.db (the data doesn't get sent anywhere else). We'd love feedback on this, the repo is open and public - so GitHub issues are welcome. We've posted on some of our social media platforms with a link to the blog post (below) so feel free to reply to us if you found it useful, or ran into friction, we want to make this something that's accessible to everyone. Blog post with the full story: https://ift.tt/BpK0C9W GitHub repo: https://ift.tt/xKB2Fjr Thanks again for your time.
New ask Hacker News story: Ask HN: Is using AI tooling for a PhD literature review dishonest?
Ask HN: Is using AI tooling for a PhD literature review dishonest?
4 by latand6 | 1 comments on Hacker News.
I'm a PhD student in structural engineering. My dissertation topic is about using LLM agents in automating FEA calculations on common Ukrainian software that companies use. I'm writing my literature review now and I've vibecoded a personal local dashboard that helps me manage the literature review process. I use LLM agents to fill up the LaTeX template (to automate formatting, also you can use IDE to view diffs) in github repo. Then I run ChatGPT Pro to collect all relevant papers (and how) to my topic. Then I collect the ones available online, where the PDFs are available. I have a special structure of folders with plain files like markdown and JSON. The idea of the dashboard is the following: I run the Codex through a web chat to identify the relevant quotes — relevant for my dissertation topic — and how they are relevant, it combines them into a number of claims connected with each quote with a link. And then I review each quote and each claim manually and tick the boxes. There is also a button that runs the verification script, that validates the exact quote IS really in the PDF. This way I can collect real evidence and collect new insights when reading these. I remember doing this all manually when I was doing my master's degree in the UK. That was a very terrible and tedious experience partially because I've ADHD So my question is, is it dishonest? Because I can defend every claim in the review because I built the verification pipeline and reviewed manually each one. I arguably understand the literature better than if I had read it myself manually and highlighted all. But I know that many universities would consider any AI-generated text as academic misconduct. I really don't quite understand the principle behind this position. Because if you outsource the task of proofreading, nobody would care. When you use Grammarly, the same thing. But if I use an LLM to create text from verified, structured, human-reviewed evidence — it might be considered dishonest.
4 by latand6 | 1 comments on Hacker News.
I'm a PhD student in structural engineering. My dissertation topic is about using LLM agents in automating FEA calculations on common Ukrainian software that companies use. I'm writing my literature review now and I've vibecoded a personal local dashboard that helps me manage the literature review process. I use LLM agents to fill up the LaTeX template (to automate formatting, also you can use IDE to view diffs) in github repo. Then I run ChatGPT Pro to collect all relevant papers (and how) to my topic. Then I collect the ones available online, where the PDFs are available. I have a special structure of folders with plain files like markdown and JSON. The idea of the dashboard is the following: I run the Codex through a web chat to identify the relevant quotes — relevant for my dissertation topic — and how they are relevant, it combines them into a number of claims connected with each quote with a link. And then I review each quote and each claim manually and tick the boxes. There is also a button that runs the verification script, that validates the exact quote IS really in the PDF. This way I can collect real evidence and collect new insights when reading these. I remember doing this all manually when I was doing my master's degree in the UK. That was a very terrible and tedious experience partially because I've ADHD So my question is, is it dishonest? Because I can defend every claim in the review because I built the verification pipeline and reviewed manually each one. I arguably understand the literature better than if I had read it myself manually and highlighted all. But I know that many universities would consider any AI-generated text as academic misconduct. I really don't quite understand the principle behind this position. Because if you outsource the task of proofreading, nobody would care. When you use Grammarly, the same thing. But if I use an LLM to create text from verified, structured, human-reviewed evidence — it might be considered dishonest.
New Show Hacker News story: Show HN: Foundations of Music (FoM)
Show HN: Foundations of Music (FoM)
2 by ersinesen | 0 comments on Hacker News.
Foundations of Music is an attempt to establish a conceptual and formal foundation for understanding music. Rather than declaring what music is, FoM shows where and how music becomes possible. It provides simple explanations to complex concepts like vibrato, glissando, and portamento to outsiders. It enables new vocabulary like jazzing, jazzing aroung, jazzing along, and jazz translation which are mind refreshing, at least to me. For a sample of translation (Turkish Folk to Blues) you may see: https://www.youtube.com/watch?v=Ml4pEk2hMM8 Proposed perceptual fatigue concept can be found highly controversial, but I think it may be an inspiring food for thought. In the end, FoM is a work in progress to constitute a stable ground from which new musical questions can be meaningfully explored.
2 by ersinesen | 0 comments on Hacker News.
Foundations of Music is an attempt to establish a conceptual and formal foundation for understanding music. Rather than declaring what music is, FoM shows where and how music becomes possible. It provides simple explanations to complex concepts like vibrato, glissando, and portamento to outsiders. It enables new vocabulary like jazzing, jazzing aroung, jazzing along, and jazz translation which are mind refreshing, at least to me. For a sample of translation (Turkish Folk to Blues) you may see: https://www.youtube.com/watch?v=Ml4pEk2hMM8 Proposed perceptual fatigue concept can be found highly controversial, but I think it may be an inspiring food for thought. In the end, FoM is a work in progress to constitute a stable ground from which new musical questions can be meaningfully explored.
New ask Hacker News story: Ask HN: How do you handle peer-to-peer discovery on iOS without a server?
Ask HN: How do you handle peer-to-peer discovery on iOS without a server?
5 by redgridtactical | 5 comments on Hacker News.
I'm building an app that syncs between phones over Bluetooth when there's no cell service. Android has Nearby Connections API which handles discovery and transport nicely. iOS has Multipeer Connectivity but it's flaky and Apple hasn't updated it in years. CoreBluetooth works but discovery is slow and you're limited to advertising 28 bytes. Has anyone found a reliable cross-platform approach to BLE device discovery that doesn't require a central server or pre-shared identifiers?
5 by redgridtactical | 5 comments on Hacker News.
I'm building an app that syncs between phones over Bluetooth when there's no cell service. Android has Nearby Connections API which handles discovery and transport nicely. iOS has Multipeer Connectivity but it's flaky and Apple hasn't updated it in years. CoreBluetooth works but discovery is slow and you're limited to advertising 28 bytes. Has anyone found a reliable cross-platform approach to BLE device discovery that doesn't require a central server or pre-shared identifiers?
New Show Hacker News story: Show HN: An event loop for asyncio written in Rust
Show HN: An event loop for asyncio written in Rust
4 by yehors | 0 comments on Hacker News.
actually, nothing special about this implementation. just another event loop written in rust for educational purposes and joy in tests it shows seamless migration from uvloop for my scraping framework https://ift.tt/sIeulER with APIs (fastapi) it shows only one advantage: better p99, uvloop is faster about 10-20% in the synthetic run currently, i am forking on the win branch to give it windows support that uvloop lacks
4 by yehors | 0 comments on Hacker News.
actually, nothing special about this implementation. just another event loop written in rust for educational purposes and joy in tests it shows seamless migration from uvloop for my scraping framework https://ift.tt/sIeulER with APIs (fastapi) it shows only one advantage: better p99, uvloop is faster about 10-20% in the synthetic run currently, i am forking on the win branch to give it windows support that uvloop lacks
New Show Hacker News story: Show HN: Travel Hacking Toolkit – Points search and trip planning with AI
Show HN: Travel Hacking Toolkit – Points search and trip planning with AI
2 by borski | 0 comments on Hacker News.
I use points and miles for most of my travel. Every booking comes down to the same decision: use points or pay cash? To answer that, you need award availability across multiple programs, cash prices, your current balances, transfer partner ratios, and the math to compare them. I got tired of doing it manually across a dozen tabs. This toolkit teaches Claude Code and OpenCode how to do it. 7 skills (markdown files with API docs and curl examples) and 6 MCP servers (real-time tools the AI calls directly). It searches award flights across 25+ mileage programs (Seats.aero), compares cash prices (Google Flights, Skiplagged, Kiwi.com, Duffel), pulls your loyalty balances (AwardWallet), searches hotels (Trivago, LiteAPI, Airbnb, Booking.com), finds ferry routes across 33 countries, and looks up weird hidden gems near your destination (Atlas Obscura). Reference data is included: transfer partner ratios for Chase UR, Amex MR, Bilt, Capital One, and Citi TY. Point valuations sourced from TPG, Upgraded Points, OMAAT, and View From The Wing. Alliance membership, sweet spot redemptions, booking windows, hotel chain brand lookups. 5 of the 6 MCP servers need zero API keys. Clone, run setup.sh, start searching. Skills are, as usual, plain markdown. They work in OpenCode and Claude Code automatically (I added a tiny setup script), and they'll work in anything else that supports skills. PRs welcome! Help me expand the toolkit! :) https://ift.tt/SV65cT0
2 by borski | 0 comments on Hacker News.
I use points and miles for most of my travel. Every booking comes down to the same decision: use points or pay cash? To answer that, you need award availability across multiple programs, cash prices, your current balances, transfer partner ratios, and the math to compare them. I got tired of doing it manually across a dozen tabs. This toolkit teaches Claude Code and OpenCode how to do it. 7 skills (markdown files with API docs and curl examples) and 6 MCP servers (real-time tools the AI calls directly). It searches award flights across 25+ mileage programs (Seats.aero), compares cash prices (Google Flights, Skiplagged, Kiwi.com, Duffel), pulls your loyalty balances (AwardWallet), searches hotels (Trivago, LiteAPI, Airbnb, Booking.com), finds ferry routes across 33 countries, and looks up weird hidden gems near your destination (Atlas Obscura). Reference data is included: transfer partner ratios for Chase UR, Amex MR, Bilt, Capital One, and Citi TY. Point valuations sourced from TPG, Upgraded Points, OMAAT, and View From The Wing. Alliance membership, sweet spot redemptions, booking windows, hotel chain brand lookups. 5 of the 6 MCP servers need zero API keys. Clone, run setup.sh, start searching. Skills are, as usual, plain markdown. They work in OpenCode and Claude Code automatically (I added a tiny setup script), and they'll work in anything else that supports skills. PRs welcome! Help me expand the toolkit! :) https://ift.tt/SV65cT0