Sage Elliott

How LLMs Are Transforming Computer Vision

Voxel51 is the data-centric machine learning software company behind FiftyOne, an open-source toolkit for building computer vision workflows. In this fireside chat, we talk with Voxel51 ML engineer Jacob Marks, about how Large Language Models (LLMs) are transforming computer vision. 

This Union AI Fireside Chat Covers

  • Introduction to Jacob Marks, ML engineer at Voxel51
  • What are LLMs and computer vision models?
  • How is Voxel51 using LLMs and computer vision?
  • What is Multimodal RAG (Retrieval Augmented Generation)?
  • A development stack for combining LLMs and computer vision
  • What is Voxel51?
  • What's the next exciting thing in AI?
  • Is the computer vision hype already here?
  • Upcoming AI work 

You can connect with Jacob Marks and Sage Elliott on LinkedIn.

👋 Say hello to other AI and MLOps practitioners in the Union/Flyte community.

⭐ Check out Flyte, the open-source ML orchestrator on GitHub.

Full AI interview transcript of ‘How LLMs Are Transforming Computer Vision’ 

(This transcript has been edited for clarity and length.) 

Sage Elliott:

Jacob, could you provide an introduction and let us know a bit more about your background and a little bit about what you're currently working on? (We will probably dive way deeper into that later.)

Jacob Marks:

Thank you for having me. Sage. It's great to be here, and I’m very excited about what you and the Union team are working on. I know that so many of our users, open-source and otherwise, are really huge fans of Flyte and other things that you guys are working on.

I'm a machine learning engineer and developer evangelist at Voxel51, and I'll give a little bit more of a background on Voxel51 later on.

I came to Voxel51 and machine learning by way of physics. So my background is actually primarily in physics. I got into machine learning through kind of an unusual route, although nowadays there are a lot of people who go from physics to machine learning. But I really primarily worked in the computer vision side, and that was my exposure to language models.

The way that I approach them is really driven by applications in computer vision and enabling these models to see and understand the world rather than focusing on a chatbot first approach.

Voxel51 is the company behind FiftyOne, the open source toolkit that enables you to build better computer vision workflows by improving the quality of your datasets and delivering insights about your models.

Sage Elliott:

What got you first into computer vision? Was that just something you stumbled upon while you were learning or were you looking to build something specific?

Jacob Marks:

During college and right after, I did a little stint as an imaging engineer: a computer vision position at a medical imaging company. It was using mostly classical techniques — not a lot of machine learning, mostly edge detection, blurring and things like that. But one of the things that stood out to me from that experience is, there's just so much work that can be done in processing medical images, wildlife conservation images and using computer vision in order to make a positive impact in the world.

The problems that are posed in computer vision are often very geometric and spatial, and that really jives well with my physics underpinnings. So it really hit home in a way that I was not expecting, it resonated with me a lot. That has kept me excited and has directed my intuition with regards to what promise to work on and what things are cool.

Sage Elliott:

I love the-computer vision industry. It's the thing that got me into machine learning as well. And I just loved all the stuff you could do with it, like in health care and agriculture and all these different industries that people don’t always think of. I found it really exciting. Just like you said, you can make a positive impact with that technology.

What are LLMs and computer vision models?

Sage Elliott:

So our main topic today is how Large Language Models are transforming computer vision. Before we dive really deep, let's do a high-level recap. I'm assuming most people here know what an LLM is and probably have no idea of what computer vision is. But just to make sure everyone is on the same page, maybe we could recap.

Jacob Marks:

For sure. One caveat: A lot of this is changing, especially on the LLM side. Some people consider 100 million parameters “large”, some people consider 7 billion to be the cutoff for what is actually large. Some people talk about things like PaLM 2, which is 340 billion parameters, as being one of the large models. But it's pretty flexible depending on the application. The general idea is these are very large language models with a lot of representational power parameter size.

There's a lot of hidden information nestled within those parameters. There's a lot that you can do to fine-tune it, and a lot that can be learned if you train it on specialized data. The language means its usefulness depends on the model, the data it was trained on and the specific objectives of the training.

It can be useful for things like being a chat bot in English or some other language, being an assistant or a code completion or a variety of those tasks. And today one of the big things and part of the transformation that I'm going to talk about is multi-modal large language models, which we'll get to soon. 

Computer vision has historically been a very distinct discipline from natural language processing and LLMs which are an offshoot of that or are kind of an extension of NLP in some regards.

Computer vision is a field concerned with understanding how computers perceive the world, via things like classification: if you have an image and you want to classify it as a dog or cat, or find where in the image a dog is or where in the image a cat is.

If you want to do things regarding the individual pixels in the image, you can use segmentation tasks. So not just saying, “OK, this is the box that contains the dog or the cat,” but, “These are the specific pixels that are dog,” and use specific pixels that are not dog and doing that on a class-by-class and pixel-by-pixel basis.

Those things are cool computer vision use cases. But the real value there [comes] once you've processed these images or videos or point clouds, which are the data processed from the representation of light (and sometimes radar for things like self-driving car applications). After doing these computer vision tasks, you can take those insights and use them to make decisions in the real world about what to do: whether to turn left or turn right, whether somebody has a tumor or not, and things like that.

How are LLMs transforming computer vision?

Sage Elliott:

Let's dive into how LLMs are transforming computer vision. What are some of the transformations we've seen happen in the computer vision space with the rise of LLMs, and where do you think that's going? 

Jacob Marks:

So again, a huge caveat: This is all changing really quickly. Tomorrow there could be something completely different that comes out. And in fact, today there was something completely different that came out: Gemini from Google, which is doing well on one of the main benchmarks, the massive multitask language understanding benchmark, which is academically how people judge the performance of a lot of these models.

This Gemini model —the large one that is going to be released next year — hit state of the art on 30 out of 32 tests. So it actually beat human-level performance on a lot of things, which is somewhat unprecedented. 

So things are changing really fast. But in general, the transformations that we're seeing and the way that I like to think about them and categorize them are in two buckets.

One is using language models as delegators, as orchestrators in a similar vein to what you guys do at Union AI, as dispatchers and as executives engines for a lot of other specialized models: vision models, audio models and others. So kind of like a centralized brain that decides what planning to do and how to compose things together.

This can be [determining] how to write a program that calls models or [deciding] the order in which to do things. We can talk more about specific projects that are doing these things. 

And the other area is in enhanced flexibility for a lot of vision tasks. So there are two examples we can talk about there that I think are super cool.

One is using things like these GPT4 vision or Llama models in order to basically take the natural language, understanding the open world knowledge of this model and combine it with visual actions to do things that are just super cool — for instance, using GPT4 with vision to sketch to HTML code generation without having to fine-tune or do a lot of prompt engineering.

In the other category, there are things like OtterHD. So these models are able to use natural language and computer vision understanding with the flexibility to perceive and understand really high-resolution images in a way that has never been done before. 

Sage Elliott:

I think those are all really cool use cases. Is there one that you particularly find more interesting or that you're really excited about out of the things that you mentioned?

Jacob Marks:

From an academic perspective, we've been seeing these dispatchers and delegators for the past five or six months now. Some examples are ViperGPT and HuggingGPT and VISPROG. Some of these are papers that have been at CVPR, the Conference on Computer Vision and Pattern Recognition; ICC, the International Conference on Computer Vision; and NeurIPS, the premier conference for most machine learning stuff.

A lot of these projects are taking the idea of using the LLM as the thinking engine, like the brain, and then deciding what models need to be called in so you can split up the reasoning from the specialized domain or model application.

Take tasks like visual question answering, where you ask a question like, “What's going on in the image?” “How many people are in the image?” “What is the expression on the person's face” or ‘How many muffins does each kid have to get for it to be fair?”

Or you can get arbitrarily complex with the questions that you're asking. And this is more complicated than captioning. Captioning is basically asking one question, but visual question answering could address any question effectively. Then you get to more complicated multimodal tasks, like grounding tasks.

Oftentimes in the past, when people asked questions specifically about the locations respectively of things within images, ask the relationship between images, and so on and so forth, they tried to train end to end a single model that would do the reasoning and the vision work internally, then just put out an answer for you. It would decide by going through all the neurons, the different layers, where the answer was. 

But the approach of a lot of these these new techniques is essentially to say, “OK, instead of trying to train one end-to-end model that is good at reasoning and vision or at audio or whatever the other specific modality is, why don't we decompose the problem into the reasoning part of it and the vision part of it?” and write a little pseudocode program, or in some cases an actual program that runs and quickly compiles, and it is executed in order to split up the reasoning.

So we can say, “OK, we need to do an object detection task for people and an object detection task for muffins. And then we need to use a mathematical library” whether it's just NumPy or you actually need to use Wolfram Mathematica or something else. And then we combine the answers and pass that into our reasoning and language based approach so we can actually respond to the user using natural language with all this information that we've gained.

So that is the basic idea behind a lot of these techniques, which are really cool. They're not necessarily novel anymore, but I think that we're going to see so much more in that regard as these models get better and better and as the speed and latency improve. So oftentimes one of the big problems with these approaches is that they take a while because you need to step back and decide what actions you want to take and how you want to split things up, and then aggregate all the results and respond.

And there are also a lot of steps, which can compound errors and compound all this latency and costs and also can also compound potential prompt issues — like if the model doesn't decide to respond but says, “This is actually out of my jurisdiction. I don't want to do this.” And you can get more and more of those things as the number of steps increases. But we're seeing these types of things get more and more robust and faster and cheaper, and I think that those approaches are going to be really pervasive for the next couple of months.

Sage Elliott:

It's amazing. I've seen some examples of this with startups around robotic navigation: grounding, knowing where objects are, what's in the image, and how to move around things and go to destinations. It's so cool seeing some of the stuff that people are doing with that technique, like  it into a physical object that can better navigate the real world and really, really complex situations. 

How is Voxel51 using LLMs and computer vision?

Sage Elliott:

You've been doing some awesome work with LLMs and computer vision at Voxel51. Can you tell us a little bit more about what you've been working on?

Jacob Marks:

A little bit of context: Voxel51 is a data-centric AI company, so our goal is to bring transparency and clarity to the world's data. We want to be the place where that work happens — really give people the tools to better understand their data and create high-quality datasets so they can train better models.

We are the lead maintainers and developers of the open-source project FiftyOne. A couple of months ago at our last fireside chat, we talked about VoxelGPT, which uses our language models essentially as a query engine for our FiftyOne query language.

So we took the LLM capabilities and we fed in all of the particular knowledge of the schema of FiftyOne datasets: the way people actually structure their computer vision data in our toolset and use knowledge of their natural language query plus the dataset in order to turn that into a working code that filters it.

You can try FiftyOne out for free online, at Voxel51. It's going strong! 

I’m currently trying to use language models to improve the state of the art for computer vision tasks. Earlier, I mentioned OtterHD. This model is a fine-tuned version of Adept fuyu-8b. The main idea is, if it doesn't actually have a vision encoder explicitly in the model. It just passes in all the tokens for images. And in the way that it does this, you can have flexibility in the number of tokens representing the image, and it treats them in a somewhat similar sense.

So that means control and flexibility over the resolution of images, and they're all treated similarly so that you can get to very, very high-resolution images and get a super good understanding of them. 

People have seen that LLMss can improve the baseline on a lot of these computer vision tasks. And the same is true for things like visual question answering, which includes more multimodal tasks. People recently have gotten really creative with the way that they approach retrieval for text. So when your log generation is huge, that forms the backbone for long-term memories for a lot of these LLMs.

Now people are even talking about Multimodal RAG (retrieval augmented generation). You take your vector database, which contains some encoded vectors representing your documents (or chunks of documents) in your database. You take the user's query, maybe process it in some way, and you compare that to the embeddings from your database to determine what relevant documents or chunks need to feed in this context into the model.

And that is used to help the model to generate some text at the end of the day, but it doesn't always have to be for retrieval in order to improve results. People use a technique called re-ranking where they take their candidate results. They've gone from this initial retrieval stage, and then they use a cross-encoder. Most of the time, they take in the embedding of the query and the embedding of the document, or the raw text of the query in the document, and compare those to each other with a cross-encoder. And that turns out to be more computationally expensive, but also more exact than a lot of the embedding multiplication, like taking the dot product to cosine similarity of the embedding itself.

So the gist is that people figured out how to do a really good re-ranking to improve the quality of these retrieval techniques for text. But there hasn't been as much work historically on re-ranking for image-based systems. And the idea that I'm pursuing right now is using these multimodal or vision language models essentially as ranking engines for your image-based retrieval tasks.

There is of course the possibility of using this in multimodal RAG. But just in general for image retrieval, state-of-the-art image retrieval models and techniques can be the starting point to develop your set of candidates. And once you have those candidates to pass the images into your multimodal model and do some rearranging on those to evaluate the results. 

What is Multimodal RAG (Retrieval Augmented Generation)?

Sage Elliott:

That's amazing stuff. Could you touch on what Multimodal RAG is?

Jacob Marks:

Absolutely. Multimodal RAG is the idea that you may not just want to pass in language or text data. In the way that humans look at the world, when we read textbooks or read books or watch TV, we're seeing text images together; oftentimes, that text and image combination (or whatever other information we are seeing) is specifically curated for us to to be combined in that way for these models. They are very, very high performing and have a really good generalized ability to understand the world. We believe that being able to feed in relevant or curated data for not just text, but also for images and potentially other modalities for these models would help them actually generate the best results. 

If you are able to feed in relevant images in addition to relevant text and do the retrieval on your database of not just text but also images and you feed them in. … Sometimes they are connected to the text in images, and sometimes they're not interceding in the text in the most relevant images. And the models are able to infer from all of those things. It's like all the information is contained in its parameters themselves, and the question you're asking it is what a good result would be. One example of how this could be useful is retrieval-augmented image captioning. This is a super-basic example, but something that I think gets the point across pretty well.

If you give one of these models an image of a car and you ask it to caption that, it's probably going to give you a fairly basic caption. It might give you a really, really in-depth description, but it's not going to know where to take it per se. It's not going to know if you want to talk about the color of the car, the price of the car, what else is going on in the room with the car or on the street with the car.

It's just not going to know all these details. But if you instead feed in a bunch of examples of image-caption pairs that are specifically cars and the captions of those cars, and if you have a database of car listings and the captions for the sports cars that are listed on your website and you feed those in, it can do a much better job of knowing what captions should be for the image. And what you're doing in this case with caption pairs can be any combination of documents from multiple modalities.

Development stack for combining LLMs and computer vision

Sage Elliott:

I know we're getting close to our time here. Could you talk a little bit about what your development stack looks like right now for combining LLMs and computer vision?

Jacob Marks:

I don't know if it's scalable yet, and I also don't think that it is probably the most state-of-the-art development that anybody has. So I won't bore people in the audience with the details! But what I will say is that my approach to development with elements otherwise is data-centric. So I know that there are questions about the emergent properties of LLMs and the fact that when you get these huge numbers of parameters in a model, you have this qualitatively different behavior than you have for smaller models.

That may be true in some cases. I think that the jury's still out on where that holds, where it doesn't hold, and there's a lot of work that needs to be done to flesh out those details. But the important thing for me is that there are still a lot of scenarios where digging into the data and focusing on small problems — starting really small and understanding how things generalize — is much more important than just throwing huge models at problems. And so my approach is very much focused on the data.

What is Voxel51?

Sage Elliott

Can you tell us a little bit more about Voxel51? Anything else you want to shout out or encourage people to check out? 

Jacob Marks:

I'll just leave you with one additional thought for Voxel51 and  FiftyOne. For a couple of years, we've had this tool set FiftyOne, which is an open-source toolkit for curation, a visualization of a lot of data in different modalities. And it's really flexible. It's really versatile. You can use it through the Python SDK, which is programmatic access to your data as well as via a UI.

But one thing that we heard from a lot of our users is that there are just so many different directions the field is going that there's so much flexibility that they need. So instead of building all the different features that everybody was interested in for the different tasks and data types, we decided to focus on building out a plugin system that makes it customizable.

So we have a really, really awesome plugin system that has been growing, and the ecosystem has been just exploding over the last couple of months. And regardless of the specific task or workflow or use case you're working on or modality, FiftyOne is now able to be extended and applied in just a few lines of code.

And we even have a plugin builder plugin in the app, so it's just super flexible. I highly encourage you to get to check it out. And even if the things that you see specifically on the website for FiftyOne aren't exactly what you need in just a few lines of code, you can make it happen.

Sage Elliott:

Do you have a plug in GPT builder yet?

Jacob Marks:

Not yet, yes. That's next!

What’s the next exciting thing in AI?

Sage Elliott:

What are you looking forward to in the AI field?

Jacob Marks:

We've seen a lot of focus recently on these foundation models and making models bigger and better and all that. But we've also seen some work on efficiency. Some of this is whether you're talking about clip models or fine-tuning huge LLMs.

A lot of work's been done in quantization, and it's speeding up the fine-tuning process. And then there's a lot of work on the upside making it easier to actually gather the data and to train a model without really needing to write too much code. Putting all this together, we're going to see a lot more open access to customized intelligence. Hopefully it's going to be real time at some point soon, and you'll be able to do it basically without writing too much code. I'm super excited for people to have this democratized access to customized intelligence.

Is the computer vision hype already here?

Sage Elliott:

One more question came in from the audience: Do you think the computer vision hype is already happening with multimodal LLMs right now? I think they meant kind of like the big LLM hype that has been happening. And do you think there's going to be a larger computer vision hype happening soon as well?

Jacob Marks:

I don't think that we have nearly hit our ChatGPT moment for computer vision yet. I think there's a lot of work and a lot of activity in multimodal LLMss or vision language models these days, but it's still riding the coattails of the wave. 

We’ll have our ChatGPT moment, once there's some advance or things become more efficient to make it more accessible to everybody. For computer vision, when we find that killer use case, the one killer app, we're going to see another explosion.

Sage Elliott:

Yeah, I agree. I think the combining of all of these things together is going to drive something really cool, and I wouldn't be surprised if you're the one who built it. Jacob.

Thank you so much for coming on and being part of our first union Fireside chat. I'm really excited that you are our first guest. It's always a pleasure talking with you!

Definitely go check out the links in description or go find Jacob Marks on LinkedIn and check  out the Voxel51 community

Watch or listen to the interview on YouTube.

Computer Vision