Abstract
Researchers, nowadays, have at their disposal valuable data from social networking applications, of which Twitter and Facebook are the most prominent examples. To retrieve this content, the Twitter service provides 2 distinct Application Programming Interfaces (APIs): a probe-based and a streaming one, each of which imposes different limitations on the data collection process. In this paper, we present a general architecture to facilitate faceted crawling of the service, which simplifies retrieval. We give implementation details of our system, while providing a simple way to express the crawling process, i.e., the crawl flow. We experimentally evaluate it on a variety of faceted crawls, depicting its efficacy for the online medium.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Abel, F., Celik, I., Houben, G.-J., Siehndel, P.: Leveraging the semantics of tweets for adaptive faceted search on twitter. In: ISCW, pp. 1–17 (2011)
Ahmed, A., Hong, L., Smola, A.J.: Hierarchical geographical modeling of user locations from social media posts. In: WWW (2013)
Bakshy, E., Hofman, J.M., Mason, W.A., Watts, D.J.: Everyone’s an influencer: quantifying influence on twitter. In: WSDM, pp. 65–74 (2011)
Barbieri, N., Bonchi, F., Manco, G.: Influence-based network-oblivious community detection. In: ICDM (2013)
Bergman, M.: The deep web: Surfacing hidden value. Technical report (2001)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: WWW, pp. 107–117 (1998)
Castillo, C.: Effective web crawling. SIGIR Forum 39(1), 55–56 (2005)
Cho, J., Garcia-Molina, H.: Effective page refresh policies for web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003)
Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: A latent variable model for geographic lexical variation. In: EMNLP (2010)
Garcia-Molina, H.: Challenges in crawling the web. In: James, A., Younas, M., Lings, B. (eds.) BNCOD 2003. LNCS, vol. 2712, p. 3. Springer, Heidelberg (2003)
Ghosh, S., Korlam, G., Ganguly, N.: Spammers’ networks within online social networks: a case-study on twitter. In: WWW, pp. 41–42 (2011)
Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Walking in facebook: A case study of unbiased sampling of osns. In: INFOCOM, pp. 2498–2506 (2010)
Grier, C., Thomas, K., Paxson, V., Zhang, M.: @spam: The underground on 140 characters or less. In: CCS, pp. 27–37 (2010)
Heydon, A., Najork, M.: Mercator: A scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999)
Kamath, K.Y., Caverlee, J.: Content-based crowd retrieval on the real-time web. In: CIKM, pp. 195–204 (2012)
Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news media? In: WWW, pp. 591–600 (2010)
Stutzbach, D., Rejaie, R., Duffield, N.G., Sen, S., Willinger, W.: On unbiased sampling for unstructured peer-to-peer networks. In: Internet Measurement Conference, pp. 27–40 (2006)
Valkanas, G., Gunopulos, D.: Location extraction from social networks with commodity software and online data. In: ICDM Workshops (SSTDM) (2012)
Valkanas, G., Gunopulos, D.: How the live web feels about events. In: CIKM, pp. 639–648 (2013)
Weng, J., Lee, B.-S.: Event detection in twitter. In: ICWSM (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Valkanas, G., Saravanou, A., Gunopulos, D. (2014). A Faceted Crawler for the Twitter Service. In: Benatallah, B., Bestavros, A., Manolopoulos, Y., Vakali, A., Zhang, Y. (eds) Web Information Systems Engineering – WISE 2014. WISE 2014. Lecture Notes in Computer Science, vol 8787. Springer, Cham. https://doi.org/10.1007/978-3-319-11746-1_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-11746-1_13
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11745-4
Online ISBN: 978-3-319-11746-1
eBook Packages: Computer ScienceComputer Science (R0)