← Back to blog
Side-by-side comparison of Apple Vision and MediaPipe detecting Downward Dog pose

Apple Vision vs MediaPipe: Which is Better for Yoga Pose Detection?

· by Brian

A technical comparison from building Eight Angle, an iOS yoga app.


Side-by-side comparison of Apple Vision and MediaPipe detecting Downward Dog pose. Vision failed to detect the pose entirely, while MediaPipe found all 19 joints with high confidence. Apple Vision (left) failed to detect this Downward Dog pose. MediaPipe (right) found all 19 joints. Green = high confidence, yellow = medium, red = low.


When building Eight Angle, we needed reliable pose detection to identify yoga poses in real-time. We started with Apple’s built-in Vision framework, but soon noticed problems: the detector wasn’t finding poses in many of our training images, especially when bodies were inverted or partially occluded. So we tested Google’s MediaPipe as an alternative.

The result? MediaPipe detected 26% more poses and achieved 4% higher classification accuracy. Here’s what we found.

Background: How Pose Detection Works

Pose detection is the computer vision task of finding a person’s body position in an image. The detector outputs a “skeleton” — a set of joint points (ankles, knees, hips, shoulders, wrists, etc.) with X/Y coordinates and confidence scores.

Apple Vision outputs 19 joints. MediaPipe actually provides 33 landmarks, but we use the 19 that map to Vision’s joints to keep the comparison fair. For each joint, you get:

  • Position: X and Y coordinates (normalized 0-1)
  • Confidence: How certain the detector is (0-1)

Yoga presents unique challenges for pose detectors:

  • Inversions: Headstands and handstands flip the expected body orientation
  • Occlusions: Arms behind the body, legs crossed over each other
  • Unusual shapes: Poses like Wheel or Eight Angle don’t look like typical standing poses

These edge cases matter because a yoga app needs to work across the full range of poses, not just standing positions.

The Comparison Setup

We evaluated both detectors using the same conditions:

  • 1,603 training images across 21 yoga pose classes
  • Same feature extraction: 31 geometric features (joint angles, distances, symmetry measures)
  • Same classifier: Random Forest with 5-fold cross-validation
  • Same hardware: All processing on Apple Silicon

The feature pipeline transforms raw joint coordinates into meaningful measurements like “hip angle” or “knee symmetry” — this matters more for classification than raw positions.

Key Finding #1: Detection Rate

The biggest difference was how many poses each detector could find.

DetectorDetectedTotalDetection Rate
Apple Vision1,1421,60371.2%
MediaPipe1,5621,60397.4%

MediaPipe detected 420 more poses — a 36.8% increase in usable training data from the same image set.

The gap was largest for challenging poses:

PoseVision SamplesMediaPipe SamplesGained
forward_fold51123+72
crow47117+70
downward_dog63130+67
bridge2691+65
handstand3791+54

Vision struggled most with poses where the body is folded or inverted. Forward Fold, for example, has the head below the hips with legs potentially occluding the torso — Vision detected fewer than half of these images.

Side-by-side comparison of Crow pose detection. Vision detected only 10 of 19 joints with low confidence, while MediaPipe detected all 19 with high confidence. Crow pose: Vision (left) found only 10 joints, mostly with low confidence. MediaPipe (right) detected all 19 joints clearly.

Key Finding #2: Classification Accuracy

More detected poses also meant better classification:

DetectorAccuracyStd Dev
Apple Vision89.6%±1.8%
MediaPipe93.4%±1.2%

MediaPipe’s accuracy was both higher and more consistent (lower standard deviation).

Per-Class Winners

The biggest improvements came from poses that Vision had trouble detecting:

PoseVisionMediaPipeChange
low_lunge66.7%92.1%+25.4%
splits66.7%80.0%+13.3%
forward_fold82.4%95.1%+12.8%
eight_angle55.9%68.1%+12.2%
downward_dog90.5%100.0%+9.5%

Low Lunge improved dramatically because MediaPipe could detect the back leg even when it’s behind the front leg. With Vision, those samples had missing joint data that hurt classification.

The Headstand Mystery

Side-by-side comparison of Headstand pose detection. Both detectors found the pose, but MediaPipe shows higher confidence (more green joints) than Vision (more yellow and red). Headstand: Both detectors found the pose, but notice MediaPipe’s higher confidence scores (green) vs Vision’s lower confidence (yellow/red).

One pose bucked the trend: Headstand went from 100% accuracy with Vision down to 88% with MediaPipe.

This seemed counterintuitive — why would better detection lead to worse classification?

The root cause: with only 41 training samples, the class was vulnerable to noise. When we investigated the 5 misclassified images, we found two patterns:

  1. Straight-arm variants (3 images): Classified as Handstand because the arms were extended, not bent
  2. Prep poses with tucked knees (2 images): Classified as Crow because the body position was similar

These weren’t detector errors — they were training data edge cases. MediaPipe detected more variation in the training set, exposing poses that were borderline between classes.

The lesson: more data exposes more edge cases. Vision’s lower detection rate masked these ambiguous samples by simply not including them.

Why the Difference?

MediaPipe’s advantage likely comes from several factors:

Better Occlusion Handling

MediaPipe had dramatically fewer missing joints. For example, knee_symmetry (which requires both knees) was missing in 17% of Vision samples but only 3.5% of MediaPipe samples.

Lower Effective Threshold

We found that MediaPipe’s joint positions are accurate even at low confidence scores. Setting the threshold to 0.001 (essentially accepting all detections) gave the best results:

ThresholdAccuracy
0.1092.6%
0.0592.4%
0.00193.4%

This is different from Vision, where low-confidence joints are often inaccurate. MediaPipe seems to output confidence scores more conservatively.

Training Data

MediaPipe was trained on a larger, more diverse pose dataset. Google’s research papers mention handling for occlusion, unusual orientations, and partial visibility — exactly the scenarios that matter for yoga.

What We Chose

For Eight Angle, we switched our training pipeline to MediaPipe. The deciding factors:

  • Detection rate mattered most. With limited training images, getting 420 more usable samples was significant. More data means better generalization.
  • Yoga poses are inherently challenging. Inversions, arm balances, and deep folds are core to the practice — we couldn’t ignore 30% of those poses.
  • The accuracy gain was a bonus. We expected more data to help, but the 4% accuracy improvement confirmed MediaPipe’s joint positions are genuinely more reliable for our use case.

Vision isn’t a bad choice for simpler applications — if you’re just detecting someone standing or walking, it works fine. But for yoga, where the whole point is putting your body in unusual positions, MediaPipe handles the edge cases that matter.


Interested in seeing how Eight Angle helps you level up your yoga practice? Join the waitlist to get early access.


Summary Statistics

MetricApple VisionMediaPipeWinner
Detection Rate71.2%97.4%MediaPipe (+26%)
Classification Accuracy89.6%93.4%MediaPipe (+4%)
Accuracy Std Dev±1.8%±1.2%MediaPipe (more consistent)
Usable Samples1,1421,562MediaPipe (+420)