HuggingFace Inference API Integration

Beginnerv1.0.0

Integrate HuggingFace's Inference API into your applications — serverless model inference, streaming responses, and dedicated endpoints without managing infrastructure.

Content

Overview

The HuggingFace Inference API provides serverless access to thousands of models without deploying infrastructure. Use it for text generation, embeddings, classification, and image tasks with simple HTTP requests.

Why This Matters

-Zero infrastructure — no GPU servers to manage
-Model variety — access 200k+ models via API
-Scalability — automatic scaling from hobby to production
-Cost efficiency — pay per request, no idle GPU costs

How It Works

Step 1: Get an API Token

bash

# Get your token from https://huggingface.co/settings/tokens
export HF_TOKEN="hf_..."

Step 2: Basic Inference

bash

# Text generation
curl https://api-inference.huggingface.co/models/Qwen/Qwen2.5-Coder-7B-Instruct \
  -H "Authorization: Bearer $HF_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"inputs": "Write a Python function to reverse a linked list"}'

# Text classification (sentiment)
curl https://api-inference.huggingface.co/models/distilbert-base-uncased-finetuned-sst-2-english \
  -H "Authorization: Bearer $HF_TOKEN" \
  -d '{"inputs": "This code refactor is excellent!"}'

Step 3: TypeScript Client

typescript

import { HfInference } from '@huggingface/inference';

const hf = new HfInference(process.env.HF_TOKEN);

// Text generation
const result = await hf.textGeneration({
  model: 'Qwen/Qwen2.5-Coder-7B-Instruct',
  inputs: 'Write a debounce function in TypeScript',
  parameters: {
    max_new_tokens: 500,
    temperature: 0.2,
    return_full_text: false,
  },
});

// Streaming
const stream = hf.textGenerationStream({
  model: 'Qwen/Qwen2.5-Coder-7B-Instruct',
  inputs: 'Explain async/await in TypeScript',
});

for await (const chunk of stream) {
  process.stdout.write(chunk.token.text);
}

// Embeddings
const embeddings = await hf.featureExtraction({
  model: 'sentence-transformers/all-MiniLM-L6-v2',
  inputs: 'function to fetch user by ID',
});

Step 4: Dedicated Endpoints (Production)

bash

# Create a dedicated endpoint via HuggingFace UI or API
# Benefits: guaranteed availability, autoscaling, custom hardware

curl https://your-endpoint-id.us-east-1.aws.endpoints.huggingface.cloud \
  -H "Authorization: Bearer $HF_TOKEN" \
  -d '{"inputs": "production query here"}'

Best Practices

-Use the @huggingface/inference npm package for TypeScript projects
-Enable streaming for interactive applications
-Set max_new_tokens to prevent runaway generation
-Use dedicated endpoints for production workloads (SLA guarantees)
-Cache responses for identical queries to reduce costs

Common Mistakes

-Not setting max_new_tokens (model generates until context limit)
-Using serverless API for production traffic (rate limited, cold starts)
-Sending large payloads without checking model's max input length
-Not handling 503 (model loading) responses with retry logic
-Exposing HF_TOKEN in client-side code

FAQ

Discussion

Loading comments...

How It Works

Step 1: Get an API Token

bash

# Get your token from https://huggingface.co/settings/tokens
export HF_TOKEN="hf_..."

Step 2: Basic Inference

bash

# Text generation
curl https://api-inference.huggingface.co/models/Qwen/Qwen2.5-Coder-7B-Instruct \
  -H "Authorization: Bearer $HF_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"inputs": "Write a Python function to reverse a linked list"}'

# Text classification (sentiment)
curl https://api-inference.huggingface.co/models/distilbert-base-uncased-finetuned-sst-2-english \
  -H "Authorization: Bearer $HF_TOKEN" \
  -d '{"inputs": "This code refactor is excellent!"}'

Step 3: TypeScript Client

typescript

import { HfInference } from '@huggingface/inference';

const hf = new HfInference(process.env.HF_TOKEN);

// Text generation
const result = await hf.textGeneration({
  model: 'Qwen/Qwen2.5-Coder-7B-Instruct',
  inputs: 'Write a debounce function in TypeScript',
  parameters: {
    max_new_tokens: 500,
    temperature: 0.2,
    return_full_text: false,
  },
});

// Streaming
const stream = hf.textGenerationStream({
  model: 'Qwen/Qwen2.5-Coder-7B-Instruct',
  inputs: 'Explain async/await in TypeScript',
});

for await (const chunk of stream) {
  process.stdout.write(chunk.token.text);
}

// Embeddings
const embeddings = await hf.featureExtraction({
  model: 'sentence-transformers/all-MiniLM-L6-v2',
  inputs: 'function to fetch user by ID',
});

Step 4: Dedicated Endpoints (Production)

bash

# Create a dedicated endpoint via HuggingFace UI or API
# Benefits: guaranteed availability, autoscaling, custom hardware

curl https://your-endpoint-id.us-east-1.aws.endpoints.huggingface.cloud \
  -H "Authorization: Bearer $HF_TOKEN" \
  -d '{"inputs": "production query here"}'

Common Mistakes

-Not setting max_new_tokens (model generates until context limit)

-Using serverless API for production traffic (rate limited, cold starts)

-Sending large payloads without checking model's max input length

-Not handling 503 (model loading) responses with retry logic

-Exposing HF_TOKEN in client-side code