or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

entity-parsing.mdgeographic-data.mdindex.mdinput-format.mdtweet-model.mduser-model.md
tile.json

entity-parsing.mddocs/

Entity Parsing

Extraction and parsing of entities from tweet text including hashtags, URLs, user mentions, media attachments, and stock symbols.

Capabilities

Entities Container

Container class that holds all parsed entities from tweet text, providing structured access to hashtags, URLs, user mentions, media, and symbols.

/**
 * Container for all entities parsed from tweet text.
 * Provides structured access to hashtags, URLs, mentions, media, and symbols.
 */
public class Entities {
    
    /**
     * Default constructor that initializes all entity lists.
     */
    public Entities();
    
    /**
     * Get the list of hashtags found in the tweet text.
     * @return List of HashTags objects
     */
    public List<HashTags> getHashtags();
    
    /**
     * Set the list of hashtags found in the tweet text.
     * @param hashtags List of HashTags objects
     */
    public void setHashtags(List<HashTags> hashtags);
    
    /**
     * Get the list of URLs found in the tweet text.
     * @return List of URL objects
     */
    public List<URL> getUrls();
    
    /**
     * Set the list of URLs found in the tweet text.
     * @param urls List of URL objects
     */
    public void setUrls(List<URL> urls);
    
    /**
     * Get the list of user mentions found in the tweet text.
     * @return List of UserMention objects
     */
    public List<UserMention> getUser_mentions();
    
    /**
     * Set the list of user mentions found in the tweet text.
     * @param user_mentions List of UserMention objects
     */
    public void setUser_mentions(List<UserMention> user_mentions);
    
    /**
     * Get the list of media attachments in the tweet.
     * @return List of Media objects
     */
    public List<Media> getMedia();
    
    /**
     * Set the list of media attachments in the tweet.
     * @param media List of Media objects
     */
    public void setMedia(List<Media> media);
    
    /**
     * Get the list of stock symbols found in the tweet text.
     * @return List of Symbol objects
     */
    public List<Symbol> getSymbols();
    
    /**
     * Set the list of stock symbols found in the tweet text.
     * @param symbols List of Symbol objects
     */
    public void setSymbols(List<Symbol> symbols);
}

HashTags

Hashtag entities parsed from tweet text with position information and cleaned text content.

/**
 * Represents hashtags parsed from tweet text.
 * Contains the hashtag text and its position indices in the original text.
 */
public class HashTags {
    
    /**
     * Get the position indices of this hashtag in the tweet text.
     * @return Array of [start, end] positions
     */
    public long[] getIndices();
    
    /**
     * Set the position indices of this hashtag in the tweet text.
     * @param indices Array of [start, end] positions
     */
    public void setIndices(long[] indices);
    
    /**
     * Set the position indices of this hashtag in the tweet text.
     * @param start Starting position in tweet text
     * @param end Ending position in tweet text
     */
    public void setIndices(long start, long end);
    
    /**
     * Get the hashtag text without the # symbol.
     * @return Hashtag text
     */
    public String getText();
    
    /**
     * Set the hashtag text, optionally processing to remove # symbol.
     * @param text Hashtag text
     * @param hashExist Whether the text includes the # symbol
     */
    public void setText(String text, boolean hashExist);
}

URL Entities

URL entities found in tweet text with expanded and display versions.

/**
 * Represents URLs found in tweet text.
 * Contains original, expanded, and display versions of URLs.
 */
public class URL {
    
    /**
     * Get the position indices of this URL in the tweet text.
     * @return Array of [start, end] positions
     */
    public long[] getIndices();
    
    /**
     * Set the position indices of this URL in the tweet text.
     * @param indices Array of [start, end] positions
     */
    public void setIndices(long[] indices);
    
    /**
     * Get the original URL as it appears in the tweet.
     * @return Original URL (usually shortened)
     */
    public String getUrl();
    
    /**
     * Set the original URL as it appears in the tweet.
     * @param url Original URL (usually shortened)
     */
    public void setUrl(String url);
    
    /**
     * Get the expanded/resolved URL.
     * @return Full expanded URL
     */
    public String getExpanded_url();
    
    /**
     * Set the expanded/resolved URL.
     * @param expanded_url Full expanded URL
     */
    public void setExpanded_url(String expanded_url);
    
    /**
     * Get the display URL shown to users.
     * @return Display-friendly URL
     */
    public String getDisplay_url();
    
    /**
     * Set the display URL shown to users.
     * @param display_url Display-friendly URL
     */
    public void setDisplay_url(String display_url);
}

UserMention

User mention entities (@username) found in tweet text.

/**
 * Represents user mentions (@username) found in tweet text.
 * Contains user information and position data.
 */
public class UserMention {
    
    /**
     * Get the position indices of this mention in the tweet text.
     * @return Array of [start, end] positions
     */
    public long[] getIndices();
    
    /**
     * Set the position indices of this mention in the tweet text.
     * @param indices Array of [start, end] positions
     */
    public void setIndices(long[] indices);
    
    /**
     * Get the mentioned user's ID.
     * @return User ID
     */
    public long getId();
    
    /**
     * Set the mentioned user's ID.
     * @param id User ID
     */
    public void setId(long id);
    
    /**
     * Get the mentioned user's ID as string.
     * @return User ID as string
     */
    public String getId_str();
    
    /**
     * Set the mentioned user's ID as string (computed from numeric ID).
     */
    public void setId_str();
    
    /**
     * Get the mentioned user's screen name.
     * @return Screen name without @ symbol
     */
    public String getScreen_name();
    
    /**
     * Set the mentioned user's screen name.
     * @param screen_name Screen name without @ symbol
     */
    public void setScreen_name(String screen_name);
    
    /**
     * Get the mentioned user's display name.
     * @return Display name
     */
    public String getName();
    
    /**
     * Set the mentioned user's display name.
     * @param name Display name
     */
    public void setName(String name);
}

Media

Media attachment entities including images and videos.

/**
 * Represents media attachments (images, videos) in tweets.
 * Contains URLs, dimensions, and metadata for media content.
 */
public class Media {
    
    /**
     * Get the position indices of this media in the tweet text.
     * @return Array of [start, end] positions
     */
    public long[] getIndices();
    
    /**
     * Set the position indices of this media in the tweet text.
     * @param indices Array of [start, end] positions
     */
    public void setIndices(long[] indices);
    
    /**
     * Get the media ID.
     * @return Media ID
     */
    public long getId();
    
    /**
     * Set the media ID.
     * @param id Media ID
     */
    public void setId(long id);
    
    /**
     * Get the media ID as string.
     * @return Media ID as string
     */
    public String getId_str();
    
    /**
     * Set the media ID as string.
     * @param id_str Media ID as string
     */
    public void setId_str(String id_str);
    
    /**
     * Get the media URL.
     * @return Media URL
     */
    public String getMedia_url();
    
    /**
     * Set the media URL.
     * @param media_url Media URL
     */
    public void setMedia_url(String media_url);
    
    /**
     * Get the HTTPS media URL.
     * @return HTTPS media URL
     */
    public String getMedia_url_https();
    
    /**
     * Set the HTTPS media URL.
     * @param media_url_https HTTPS media URL
     */
    public void setMedia_url_https(String media_url_https);
    
    /**
     * Get the display URL for this media.
     * @return Display URL
     */
    public String getDisplay_url();
    
    /**
     * Set the display URL for this media.
     * @param display_url Display URL
     */
    public void setDisplay_url(String display_url);
    
    /**
     * Get the expanded URL for this media.
     * @return Expanded URL
     */
    public String getExpanded_url();
    
    /**
     * Set the expanded URL for this media.
     * @param expanded_url Expanded URL
     */
    public void setExpanded_url(String expanded_url);
    
    /**
     * Get the original URL that was extracted from the tweet.
     * @return Original URL
     */
    public String getUrl();
    
    /**
     * Set the original URL that was extracted from the tweet.
     * @param url Original URL
     */
    public void setUrl(String url);
    
    /**
     * Get the media type (photo, video, etc.).
     * @return Media type
     */
    public String getType();
    
    /**
     * Set the media type (photo, video, etc.).
     * @param type Media type
     */
    public void setType(String type);
    
    /**
     * Get the available sizes for this media.
     * @return Map of size names to Size objects
     */
    public Map<String, Size> getSizes();
    
    /**
     * Set the available sizes for this media.
     * @param sizes Map of size names to Size objects
     */
    public void setSizes(Map<String, Size> sizes);
}

Symbol

Stock symbol entities ($SYMBOL) found in tweet text.

/**
 * Represents stock symbols ($SYMBOL) found in tweet text.
 * Contains symbol text and position information.
 */
public class Symbol {
    
    /**
     * Get the position indices of this symbol in the tweet text.
     * @return Array of [start, end] positions
     */
    public long[] getIndices();
    
    /**
     * Set the position indices of this symbol in the tweet text.
     * @param indices Array of [start, end] positions
     */
    public void setIndices(long[] indices);
    
    /**
     * Get the stock symbol text without the $ symbol.
     * @return Stock symbol text
     */
    public String getText();
    
    /**
     * Set the stock symbol text.
     * @param text Stock symbol text
     */
    public void setText(String text);
}

Size

Media size information for different image/video dimensions.

/**
 * Represents size information for media attachments.
 * Contains dimensions and resize information for different media sizes.
 */
public class Size {
    
    /**
     * Constructor with size dimensions and resize method.
     * @param width Width in pixels
     * @param height Height in pixels
     * @param resize Resize method
     */
    public Size(long width, long height, String resize);
    
    /**
     * Get the width of this media size.
     * @return Width in pixels
     */
    public long getWidth();
    
    /**
     * Set the width of this media size.
     * @param width Width in pixels
     */
    public void setWidth(long width);
    
    /**
     * Get the height of this media size.
     * @return Height in pixels
     */
    public long getHeight();
    
    /**
     * Set the height of this media size.
     * @param height Height in pixels
     */
    public void setHeight(long height);
    
    /**
     * Get the resize method for this media size.
     * @return Resize method (fit, crop, etc.)
     */
    public String getResize();
    
    /**
     * Set the resize method for this media size.
     * @param resize Resize method (fit, crop, etc.)
     */
    public void setResize(String resize);
}

Usage Examples:

import org.apache.flink.contrib.tweetinputformat.model.tweet.entities.*;
import java.util.List;
import java.util.Map;

// Process all entities in a tweet
Tweet tweet = // ... get tweet
Entities entities = tweet.getEntities();

// Extract hashtags
List<HashTags> hashtags = entities.getHashtags();
for (HashTags tag : hashtags) {
    System.out.println("Hashtag: #" + tag.getText());
    long[] indices = tag.getIndices();
    System.out.println("Position: " + indices[0] + "-" + indices[1]);
}

// Extract URLs
List<URL> urls = entities.getUrls();
for (URL url : urls) {
    System.out.println("URL: " + url.getUrl());
    System.out.println("Expanded: " + url.getExpanded_url());
    System.out.println("Display: " + url.getDisplay_url());
}

// Extract user mentions
List<UserMention> mentions = entities.getUser_mentions();
for (UserMention mention : mentions) {
    System.out.println("Mentioned: @" + mention.getScreen_name());
    System.out.println("Name: " + mention.getName());
}

// Extract media
List<Media> mediaList = entities.getMedia();
for (Media media : mediaList) {
    System.out.println("Media type: " + media.getType());
    System.out.println("Media URL: " + media.getMedia_url_https());
    
    // Check available sizes
    Map<String, Size> sizes = media.getSizes();
    if (sizes.containsKey("large")) {
        Size largeSize = sizes.get("large");
        System.out.printf("Large size: %dx%d%n", largeSize.getWidth(), largeSize.getHeight());
    }
}

// Extract stock symbols
List<Symbol> symbols = entities.getSymbols();
for (Symbol symbol : symbols) {
    System.out.println("Stock symbol: $" + symbol.getText());
}

Entity Analysis Patterns

Common patterns for analyzing entities in stream processing:

// Popular hashtags analysis
tweets.flatMap(tweet -> {
    return tweet.getEntities().getHashtags().stream()
        .map(HashTags::getText)
        .collect(Collectors.toList());
}).countByValue();

// URL domain analysis
tweets.flatMap(tweet -> {
    return tweet.getEntities().getUrls().stream()
        .map(url -> extractDomain(url.getExpanded_url()))
        .collect(Collectors.toList());
});

// User mention network analysis
tweets.flatMap(tweet -> {
    String author = tweet.getUser().getScreen_name();
    return tweet.getEntities().getUser_mentions().stream()
        .map(mention -> new UserInteraction(author, mention.getScreen_name()))
        .collect(Collectors.toList());
});

// Media type distribution
tweets.filter(tweet -> !tweet.getEntities().getMedia().isEmpty())
      .map(tweet -> tweet.getEntities().getMedia().get(0).getType())
      .countByValue();

// Stock symbol tracking
tweets.filter(tweet -> !tweet.getEntities().getSymbols().isEmpty())
      .flatMap(tweet -> tweet.getEntities().getSymbols().stream()
          .map(Symbol::getText)
          .collect(Collectors.toList()));

Position-Based Text Extraction

Using entity indices to extract text segments:

public String extractEntityText(String tweetText, long[] indices) {
    int start = (int) indices[0];
    int end = (int) indices[1];
    return tweetText.substring(start, end);
}

// Example usage
String tweetText = tweet.getText();
for (HashTags hashtag : tweet.getEntities().getHashtags()) {
    String hashtagWithSymbol = extractEntityText(tweetText, hashtag.getIndices());
    // hashtagWithSymbol will include the # symbol
    System.out.println("Full hashtag: " + hashtagWithSymbol);
    System.out.println("Clean text: " + hashtag.getText());
}