{ parse: url -> structured_data }
{ parse: url -> structured_data }
{ parse: url -> structured_data }
{ parse: url -> structured_data }
{ parse: url -> structured_data }
{ parse: url -> structured_data }
{ parse: url -> structured_data }
{ parse: url -> structured_data }
{ parse: url -> structured_data }
{ parse: url -> structured_data }
const data = await scraper.extract(url);
const data = await scraper.extract(url);
const data = await scraper.extract(url);
const data = await scraper.extract(url);
const data = await scraper.extract(url);
const data = await scraper.extract(url);
const data = await scraper.extract(url);
const data = await scraper.extract(url);
const data = await scraper.extract(url);
const data = await scraper.extract(url);

Sentient Scraper APITurn Web Pages into Clean JSON Data

Our intelligent API transforms unstructured web content into clean, consistent JSON. Perfect for data pipelines, content APIs, and developer workflows.

RESTful API
AI-Enhanced Extraction
Schema Normalization
sentient-api.mediathrive.com

Enter a URL to any specific page (article, blog post, product, etc.) to see it transformed into structured JSON data.

api-example.js
1// Submit a URL for scraping 2const response = await fetch('https://sentient-api.mediathrive.com/api/scrape', { 3 method: 'POST', 4 headers: { 5 'Content-Type': 'application/json' 6 }, 7 body: JSON.stringify({ 8 url: 'https://example.com/article', 9 fetchOptions: { 10 forceStrategy: 'http', 11 timeout: 30000, 12 additionalWaitMs: 1000 13 } 14 }) 15}); 16 17// Get job ID from response 18const { jobId } = await response.json(); 19 20// Retrieve the scrape results 21const result = await fetch(`https://sentient-api.mediathrive.com/api/scrape/${jobId}`); 22const data = await result.json(); 23 24// Access the extracted schema data 25console.log(`Detected schema: ${data.detected_schema_type}`); 26console.log(data.enriched_schema);
99.9% Uptime SLA
15+ Schema Types Supported
Comprehensive Documentation
Developer-First Support

Developer-First API Features

Built for engineers who need reliable data extraction at scale

Schema.org Compliant Extraction

Our AI automatically detects content types and extracts them into schema.org compatible formats with support for 15+ primary schema types.

1// GET /api/scrape/:id response 2{ 3 "id": "12345678-1234-1234-1234-123456789012", 4 "url": "https://example.com/article", 5 "status": "completed", 6 "detected_schema_type": "Article", 7 "detected_schema_confidence": 0.95, 8 "enriched_schema": { 9 "@type": "Article", 10 "headline": "Example Article", 11 "author": "John Doe", 12 "datePublished": "2023-01-01T00:00:00Z", 13 "articleBody": "This is an example article..." 14 } 15}

RESTful API Endpoints

Complete RESTful API with endpoints for scrape job management, rule configuration, and application status monitoring.

1// Submit a scrape job 2POST /api/scrape 3{ 4 "url": "https://example.com/page", 5 "fetchOptions": { 6 "forceStrategy": "http", 7 "timeout": 30000, 8 "additionalWaitMs": 1000, 9 "headers": { 10 "User-Agent": "Custom User Agent" 11 } 12 } 13}

Intelligent Caching

Smart caching system automatically detects when previously scraped content is requested and returns cached results for improved performance.

1// Cache hit response 2{ 3 "message": "Scrape job submitted", 4 "jobId": "12345678-1234-1234-1234-123456789012", 5 "url": "https://example.com/page", 6 "status": "completed", 7 "cached": true, 8 "schemaType": "Article" 9}

Human-in-the-Loop Validation

Powerful HITL workflow for reviewing and approving AI-generated extraction rules, ensuring the highest data quality for your most critical sources.

1// GET /api/hitl/rules/:id 2{ 3 "ruleset": { 4 "id": "12345678-1234-1234-1234-123456789012", 5 "domain": "example.com", 6 "schema_type": "Article", 7 "rules": { 8 "title": { 9 "selector": "h1.title", 10 "type": "text" 11 }, 12 "author": { 13 "selector": "span.author", 14 "type": "text" 15 } 16 }, 17 "ai_confidence_score": 0.92 18 }, 19 "sample": { 20 "url": "https://example.com/sample-page", 21 "extracted_schema": { 22 "@type": "Article", 23 "title": "Sample Article", 24 "author": "Jane Smith" 25 } 26 } 27}

Advanced Fetching Strategies

Flexible options for content retrieval including HTTP-based and headless browser strategies, with customizable timeouts and request headers.

1// POST /api/scrape with fetch options 2{ 3 "url": "https://example.com/dynamic-page", 4 "fetchOptions": { 5 "forceStrategy": "playwright", 6 "timeout": 60000, 7 "additionalWaitMs": 2000, 8 "headers": { 9 "User-Agent": "Mozilla/5.0 ...", 10 "Accept-Language": "en-US,en;q=0.9" 11 } 12 } 13}

Comprehensive Metrics

Detailed metrics and status reporting for monitoring extraction performance, rule effectiveness, and application health.

1// GET /api/status/metrics (Prometheus format) 2# HELP sentient_jobs_total Total number of jobs processed 3# TYPE sentient_jobs_total counter 4sentient_jobs_total{status="completed",type="scrape"} 100 5sentient_jobs_total{status="failed",type="scrape"} 5 6 7# HELP sentient_schema_detection_confidence 8# TYPE sentient_schema_detection_confidence gauge 9sentient_schema_detection_confidence{schema="Article"} 0.92 10sentient_schema_detection_confidence{schema="Product"} 0.89

Structured Data Extraction Use Cases

How our API transforms web content into structured Schema.org data for various applications

News & Article ExtractionArticle, NewsArticle

Extract structured article data from news sites, blogs, and publications.

Schema.org Article types provide structured representations of news content, blog posts, and published articles with standardized properties for headlines, authors, publication dates, and content.
  • Automatic detection of Article, NewsArticle, and BlogPosting types
  • Extract headline, author, date, content, and category information
  • Identify related content and publications for a complete content graph
1// Article schema example (simplified) 2{ 3 "id": "12345678-1234-1234-1234-123456789012", 4 "url": "https://example.com/article", 5 "status": "completed", 6 "schemaType": "NewsArticle", 7 "data": { 8 "@type": "NewsArticle", 9 "headline": "Breaking News: Important Event", 10 "author": "John Doe", 11 "datePublished": "2025-03-29T10:56:25+00:00", 12 "articleBody": "This is the main content of the article..." 13 } 14}

E-Commerce Product DataProduct

Extract structured product information from e-commerce sites.

The Product schema represents items for sale with detailed specifications. It captures pricing, availability, reviews, and product details in a standardized format for e-commerce applications.
  • Complete Product schema extraction with pricing, availability, and ratings
  • Extract product images, descriptions, and specifications
  • Monitor price changes and inventory status over time
1// Product schema example (simplified) 2{ 3 "schemaType": "Product", 4 "data": { 5 "@type": "Product", 6 "name": "Premium Wireless Headphones", 7 "description": "High-quality wireless headphones with noise cancellation", 8 "image": "https://example.com/images/headphones.jpg", 9 "brand": { 10 "@type": "Brand", 11 "name": "AudioTech" 12 }, 13 "offers": { 14 "@type": "Offer", 15 "price": 129.99, 16 "priceCurrency": "USD", 17 "availability": "https://schema.org/InStock" 18 } 19 } 20}

Recipe CollectionRecipe

Extract detailed recipe information from food blogs and recipe sites.

Recipe schema captures cooking instructions, ingredients, nutrition information, and preparation details. It's perfect for food blogs, recipe collections, and culinary applications.
  • Extract ingredients, instructions, cooking times, and nutritional data
  • Build searchable recipe databases with structured attributes
  • Analyze recipe trends and ingredient combinations across sources
1// Recipe schema example (simplified) 2{ 3 "schemaType": "Recipe", 4 "data": { 5 "@type": "Recipe", 6 "name": "Chocolate Chip Cookies", 7 "recipeCategory": ["Dessert", "Baking"], 8 "recipeIngredient": [ 9 "2 cups all-purpose flour", 10 "1 cup butter", 11 "1 cup chocolate chips" 12 ], 13 "recipeInstructions": [ 14 "Preheat oven to 350°F", 15 "Mix ingredients", 16 "Bake for 12 minutes" 17 ], 18 "cookTime": "PT12M" 19 } 20}

Event Data CollectionEvent

Extract event information from venue sites and ticketing platforms.

Event schema structures information about happenings with defined times and locations. It includes details about venues, performers, organizers, and ticketing essential for event aggregation.
  • Extract event name, dates, venue, performers, and ticket information
  • Build event aggregation platforms with structured data
  • Monitor for new events and schedule changes automatically
1// Event schema example (simplified) 2{ 3 "schemaType": "Event", 4 "data": { 5 "@type": "Event", 6 "name": "Annual Tech Conference", 7 "startDate": "2025-06-15T09:00:00-07:00", 8 "endDate": "2025-06-17T17:00:00-07:00", 9 "location": { 10 "@type": "Place", 11 "name": "Convention Center", 12 "address": { 13 "@type": "PostalAddress", 14 "addressLocality": "San Francisco" 15 } 16 }, 17 "performer": { 18 "@type": "Person", 19 "name": "Jane Smith" 20 } 21 } 22}

Organization & Person DataOrganization, Person

Extract structured data about organizations and people.

Organization and Person schemas capture details about entities. Organizations include companies and institutions, while Person schema covers individuals with their properties, relationships, and identifiers.
  • Build comprehensive company and personnel databases
  • Extract contact information, social profiles, and affiliations
  • Enrich CRM data with structured information from the web
1// Organization schema example (simplified) 2{ 3 "schemaType": "Organization", 4 "data": { 5 "@type": "Organization", 6 "name": "Acme Corporation", 7 "description": "Leading provider of innovative solutions", 8 "url": "https://example.com", 9 "logo": "https://example.com/logo.png", 10 "address": { 11 "@type": "PostalAddress", 12 "streetAddress": "123 Main St", 13 "addressLocality": "San Francisco" 14 }, 15 "telephone": "+1-555-123-4567", 16 "sameAs": [ 17 "https://twitter.com/acmecorp", 18 "https://linkedin.com/company/acmecorp" 19 ] 20 } 21}

Web Page MetadataWebPage

Extract comprehensive metadata from web pages for organization and indexing.

WebPage schema provides a structured representation of web content with metadata about the page itself, including authors, modification dates, breadcrumbs, and related content for improved organization.
  • Extract page titles, descriptions, and metadata
  • Capture page structure including breadcrumbs and navigation
  • Identify site sections, categories, and related content
1// WebPage schema example (simplified) 2{ 3 "id": "7874a552-8c19-4bbb-8d3b-132889c1a8f0", 4 "url": "https://example.com/page", 5 "status": "completed", 6 "schemaType": "WebPage", 7 "data": { 8 "@type": "WebPage", 9 "name": "Page Title - Example Site", 10 "headline": "Main Headline of the Page", 11 "description": "This is the page description...", 12 "datePublished": "2025-03-29T10:56:25+00:00", 13 "dateModified": "2025-03-29T10:56:25+00:00", 14 "breadcrumb": { 15 "@id": "https://example.com/page#breadcrumb" 16 }, 17 "inLanguage": "en-US" 18 } 19}
Join The Revolution

The Future ofMedia is Community

Join a community of media professionals building the next generation of tools. Get free access to premium features and shape the future of the industry.

Community-driven platform

Connect with industry professionals

Free media monitoring

Track mentions and coverage in real-time

Digital newsroom tools

Manage your media presence efficiently

AI-powered collaboration

Work smarter with intelligent tools

Discord Community
Join Discord

Be part of the next media revolution

Frequently asked questions

If you can't find what you're looking for, email our support team and if you're lucky someone will get back to you.

    • What exactly does the Sentient Scraper do?

      The Sentient Scraper is an AI-powered tool that extracts structured data from any web page. It automatically identifies the content type (e.g., article, product listing, recipe) and converts it into a clean, well-organized JSON format that can be easily integrated into your applications, databases, or content management systems.

    • What types of content can it extract?

      Our scraper can extract virtually any type of content including news articles, blog posts, product listings, recipes, event listings, job postings, real estate listings, and more. The AI automatically identifies the schema type and extracts the relevant fields for that content type.

    • How accurate is the extraction?

      The Sentient Scraper uses advanced AI to achieve high accuracy in content extraction. It provides confidence scores for extracted elements and continuously improves through machine learning. For most common content types on well-structured pages, the accuracy rate exceeds 95%.

    • How many pages can I scrape with my subscription?

      Each subscription plan comes with a different allocation of monthly scrapes. The free trial allows 3 scrapes. Our Basic plan includes 1,000 scrapes per month, Professional includes 10,000, and Enterprise plans offer custom volumes. Please visit our pricing page for the latest details on all subscription options.

    • Do I need technical skills to use the Sentient Scraper?

      No technical skills are required to use the basic features of the Sentient Scraper. Our user-friendly interface allows you to simply enter a URL and get structured data in return. For advanced integrations, we provide comprehensive API documentation and developer resources.

    • Is there an API available for the Sentient Scraper?

      Yes, we offer a RESTful API that allows you to integrate the Sentient Scraper directly into your applications and workflows. The API supports all the same features as the web interface, with additional options for customization and batch processing.