Episodes

  • Improving Agent Design, JPEG-LM's Visual Breakthrough, TurboEdit's Real-Time Image Edits, Video Segmentation Advances, LLMs Learning Like Humans, RL Benchmarks
    Aug 21 2024
    xGen-MM (BLIP-3): A Family of Open Large Multimodal Models JPEG-LM: LLMs as Image Generators with Canonical Codec Representations Automated Design of Agentic Systems TurboEdit: Instant text-based image editing Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning Fine-tuning Large Language Models with Human-inspired Learning Strategies in Medical Question Answering D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning
    Show More Show Less
    16 mins
  • Science & Clinical LLMs Leaps, Enhancing Small Model Reasoning, New Frontiers in Controlled Media Generation
    Aug 16 2024
    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery Med42-v2: A Suite of Clinical LLMs Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers ControlNeXt: Powerful and Efficient Control for Image and Video Generation CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer FruitNeRF: A Unified Neural Radiance Field based Fruit Counting Framework VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
    Show More Show Less
    14 mins
  • Multimodal Benchmarks, Visual Task Transfer, and 3D Object Generation
    Aug 8 2024
    MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models LLaVA-OneVision: Easy Visual Task Transfer An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning using Instruct Prompts Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters Diffusion Models as Data Mining Tools
    Show More Show Less
    14 mins
  • Image and Video Segmentation with SAM 2, Gemma 2 for Efficient Language Models, Boosting Small Models with Contrastive Fine-Tuning, and MM-Vet v2 Challenges Large Multimodal Models
    Aug 5 2024
    SAM 2: Segment Anything in Images and Videos Gemma 2: Improving Open Language Models at a Practical Size Coarse Correspondence Elicit 3D Spacetime Understanding in Multimodal Language Model Improving Text Embeddings for Smaller Language Models Using Contrastive Fine-tuning OmniParser for Pure Vision Based GUI Agent SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities
    Show More Show Less
    14 mins
  • Text-Guided Image Inpainting, AMEX for Mobile GUI Agents, AgentScope's Multi-Agent Simulation
    Jul 30 2024
    Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model LAMBDA: A Large Model Based Data Agent AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation Very Large-Scale Multi-Agent Simulation in AgentScope Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? Course-Correction: Safety Alignment Using Synthetic Preferences
    Show More Show Less
    14 mins
  • OpenDevin & AI Software Development, Enhancing Visual Language Models, , DDK: Refining Large Language Model Efficiency through Domain Knowledge
    Jul 25 2024
    OpenDevin: An Open Platform for AI Software Developers as Generalist Agents VILA^2: VILA Augmented VILA HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation PERSONA: A Reproducible Testbed for Pluralistic Alignment SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency Scalify: scale propagation for efficient low-precision LLM training DDK: Distilling Domain Knowledge for Efficient Large Language Models
    Show More Show Less
    14 mins
  • Vocabulary Expansion for Large Models, Big Data Enhancing LMs, 4D Reconstruction Progress, AI Cityscape Generation, DPO Policy Analysis, Expanding Code Models, Multimodal LM Trust Evaluation
    Jul 22 2024
    Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies Scaling Retrieval-Based Language Models with a Trillion-Token Datastore Shape of Motion: 4D Reconstruction from a Single Video Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion Understanding Reference Policies in Direct Preference Optimization Scaling Granite Code Models to 128K Context Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study
    Show More Show Less
    15 mins
  • Qwen2 Language Model, Mitigating Privacy Risks in LLMs, Exploring Non-Determinism, Increased Efficiency with Q-Sparse, GRUtopia for Embodied AI
    Jul 17 2024
    Qwen2 Technical Report Learning to Refuse: Towards Mitigating Privacy Risks in LLMs The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism Q-Sparse: All Large Language Models can be Fully Sparsely-Activated GRUtopia: Dream General Robots in a City at Scale
    Show More Show Less
    11 mins