Files:
pdf.png SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks 1.0 HOT

Download

Raúl Gracia-Tinedo, Danny Harnik, Dalit Naor, Dmitry Sotnikov, Sivan Toledo and Aviad Zuck

13th USENIX Conference on File and Storage Technologies (FAST 15)

Storage system benchmarks either use samples of proprietary data or synthesize artificial data in simple ways (such as using zeros or random data). However, many storage systems behave completely differently on such artificial data than they do on real-world data. This is the case with systems that include data reduction techniques, such as compression and/or deduplication.

To address this problem, we propose a benchmarking methodology called mimicking and apply it in the domain of data compression. Our methodology is based on characterizing the properties of real data that influence the performance of compressors. Then, we use these characterizations to generate new synthetic data that mimics the real one in many aspects of compression. Unlike current solutions that only address the compression ratio of data, mimicking is flexible enough to also emulate compression times and data heterogeneity. We show that these properties matter to the system’s performance.

In our implementation, called SDGen, characterizations take at most 2:5KB per data chunk (e.g., 64KB) and can be used to efficiently share benchmarking data in a highly anonymized fashion; sharing it carries few or no privacy concerns. We evaluated our data generator’s accuracy on compressibility and compression times using real-world datasets and multiple compressors (lz4, zlib, bzip2 and lzma). As a proof-of-concept, we integrated SDGen as a content generation layer in two popular benchmarks (LinkBench and Impressions).



Created
Size
Downloads
2015-02-25
872.21 KB
2,341.00
pdf.png The Power of Swarming in Personal Clouds Under Bandwidth Budget HOT

Download

Rahma Chaabouni, Marc Sánchez-Artigas and Pedro García-López

Elsevier Journal of Network and Computer Applications

Users are unceasingly relying on personal clouds (like Dropbox, Box, etc) to store, edit and retrieve their files stored in remote servers. These systems generally follow a client–server model to distribute the files to end-users. This means that they require a huge amount of bandwidth to meet the requirements of their clients. Personal clouds with limited bandwidth budget can benefit from the upload speed of the clients interested in the same content to improve the quality of service. This can be done by introducing a peer-to-peer protocol, BitTorrent for instance, when the load on a certain content becomes high. The main challenge is to decide when to switch to BitTorrent and how to allocate the cloud's available bandwidth to the different clients. In this paper, we propose an algorithm for the allocation of the cloud's bandwidth. Based on the current load and the predefined quality of service constraints, the algorithm identifies the most suitable protocol for each swarm and provides the corresponding bandwidth allocation. We validate the algorithm using a real trace of the Ubuntu One system and the results show important gains in the download times experienced by the clients.



Created
Size
Downloads
2016-04-14
3.8 MB
1,873.00
pdf.png Vertigo: Programmable Micro-controllers for Software-Defined Object Storage HOT

Download

J. Sampé, M. Sánchez-Artigas and P. García-López

IEEE International Conference on Cloud Computing (IEEE CLOUD'16)

Software-defined storage (SDS) aims to minimize the complexity of data management in the Cloud. SDS decouples the control plane from the data plane and simplifies the management of the storage system via automated storage policy enforcement. In this paper, we propose a novel SDS framework for Object Storage that allows to decentralize policy enforcement through the deployment of per-object management policies in the storage nodes. As in active storage systems, we leverage the underutilized CPU time in the storage nodes. But our framework goes one step further. It provides a new management abstraction called micro-controllers which operate on objects depending on their state and content, thereby permitting the implementation of sophisticated management policies, such as the automated deletion of an object based on its access history, and even allowing the orchestration of active storage tasks.

Our SDS system avoids the massive interception of data flows by moving that logic to the appropriate objects. Furthermore, our extensible model simplifies the customization of Object Storage services. We present in the validation several interesting use cases such as automated deletion, content level access control, and Web prefetching. 



Created
Size
Downloads
2016-06-13
255.61 KB
1,722.00
pdf.png Understanding Data Sharing in Private Personal Clouds HOT

Download

R. Gracia-Tinedo, P. García-López, A. Gómez and A. Illana

IEEE International Conference on Cloud Computing (IEEE CLOUD'16)

Data sharing in Personal Clouds blurs the lines between on-line storage and content distribution with a strong social component. Such social information may be exploited by researchers to devise optimized data management techniques for Personal Clouds. Unfortunately, due their proprietary nature, data sharing is one of the least studied facets of these systems.

In this work, we present the first study of data sharing in a private Personal Cloud. Concretely, we contribute a dataset collected at the metadata back-end of NEC: an enterprise oriented Personal Cloud. First, our analysis provides a deep inspection of the storage layer of NEC, comparing it with a well-known public vendor (UbuntuOne). Second, we study the social structure of NEC user communities, as well as the storage characteristics of user sharing links via multiplex network techniques.

Finally, we discuss a battery of data management optimizations for NEC derived from our findings, which may be of independent interest for other similar systems. Our proposals include content distribution, caching and data placement. We believe that both our study and dataset will foster further research in this field.



Created
Size
Downloads
2016-06-13
1.26 MB
1,957.00
pdf.png IOStack: Software-Defined Object Storage HOT

Download

Raúl Gracia-Tinedo, Pedro García-López, Marc Sánchez-Artigas, Josep Sampé, Yosef Moatti, Eran Rom, Dalit Naor, Ramon Nou, Toni Cortés, William Oppermann and Pietro Michiardi

IEEE Internet Computing

As the complexity and scale of cloud storage systems grow, software-defined storage (SDS) has become a prime candidate to simplify cloud storage management. In this work, the authors present IOStack: the first SDS architecture for object stores (such as OpenStack Swift). At the control plane, administrators provision SDS services to tenants according to policies expressed via a highlevel DSL. At the data plane, IOStack helps build a variety of filters, ranging from arbitrary computations on objects to data management mechanisms. Experiments illustrate that IOStack enables easy and effective policy-based provisioning, which can significantly improve the operation of a multitenant object store. 



Created
Size
Downloads
2016-06-13
1.47 MB
2,169.00
pdf.png Dissecting UbuntuOne: Autopsy of a Global-scale Personal Cloud Back-end HOT

Download

Raúl Gracia-Tinedo, Yongchao Tian, Josep Sampé, Hamza Harkous, John Lenton, Pedro García-López, Marc Sánchez-Artigas and Marko Vukolic

ACM Conference on Internet Measurement Conference (IMC '15)

Personal Cloud services, such as Dropbox or Box, have been widely adopted by users. Unfortunately, very little is known about the internal operation and general characteristics of Personal Clouds since they are proprietary services.

In this paper, we focus on understanding the nature of Personal Clouds by presenting the internal structure and a measurement study of UbuntuOne (U1). We first detail the U1 architecture, core components involved in the U1 metadata service hosted in the datacenter of Canonical, as well as the interactions of U$1$ with Amazon S3 to outsource data storage. To our knowledge, this is the first research work to describe the internals of a large-scale Personal Cloud.

Second, by means of tracing the U1 servers, we provide an extensive analysis of its back-end activity for one month. Our analysis includes the study of the storage workload, the user behavior and the performance of the U1 metadata store. Moreover, based on our analysis, we suggest improvements to U1 that can also benefit similar Personal Cloud systems.

Finally, we contribute our dataset to the community, which is the first to contain the back-end activity of a large-scale Personal Cloud. We believe that our dataset provides unique opportunities for extending research in the field.



Created
Size
Downloads
2016-06-13
2.61 MB
1,760.00
pdf.png Experimental Performance Evaluation of Cloud-Based Analytics-as-a-Service HOT

Download

Francesco Pace, Marco Milanesio, Daniele Venzano, Damiano Carra, Pietro Michiardi

IEEE International Conference on Cloud Computing (IEEE CLOUD'16)

An increasing number of Analytics-as-a-Service solutions has recently seen the light, in the landscape of cloud-based services. These services allow flexible composition of compute and storage components, that create powerful data ingestion and processing pipelines. This work is a first attempt at an experimental evaluation of analytic application performance executed using a wide range of storage service configurations. We present an intuitive notion of data locality, that we use as a proxy to rank different service compositions in terms of expected performance. Through an empirical analysis, we dissect the performance achieved by analytic workloads and unveil problems due to the impedance mismatch that arise in some configurations. Our work paves the way to a better understanding of modern cloud-based analytic services and their performance, both for its end-users and their providers.

 



Created
Size
Downloads
2016-06-13
532.01 KB
1,752.00
pdf.png Crystal: Software-Defined Storage for Multi-tenant Object Stores HOT

Download

Raúl Gracia-Tinedo, Josep Sampé, Edgar-Zamora, Marc Sánchez-Artigas , Pedro García-López, Yosef Moatti and Eran Rom

USENIX Conference on File and Storage Technologies (FAST '17)

Object stores are becoming pervasive due to their scalability and simplicity. Their broad adoption, however, contrasts with their rigidity for handling heterogeneous workloads and applications with evolving requirements, which prevents the adaptation of the system to such varied needs. In this work, we present Crystal, the first Software-Defined Storage (SDS) architecture whose core objective is to efficiently support  multi-tenancy in object stores. Crystal adds a filtering abstraction at the data plane and exposes it to the control plane to enable high-level policies at the tenant, container and object granularities. Crystal translates these policies into a set of distributed controllers that can orchestrate filters at the data plane based on real-time workload information.



Created
Size
Downloads
2017-01-30
783.25 KB
1,518.00
pdf.png Improving the QoE in Personal Clouds with Cross-Swarm Bundling HOT

Download

Rahma ChaabouniMarc Sánchez-ArtigasAla Chaabouni, Pedro García-López

IEEE 41st Conference on Local Computer Networks (IEEE LCN'16)

Personal cloud storage systems, like Dropbox, are revolutionizing the way people think about and access their files. As the prevailing model, these systems use unicast to push file changes to each of the “unsynced” devices. And as a result, they transmit multiple times the same information, once per unsynced device. This puts an unnecessary strain on outgoing bandwidth at the datacenters. One way to address this is to leverage P2P-like content distribution to benefit from user resources at the edges of the Internet.

Although protocols like BitTorrent have proven to be effective in this scenario, we go a step further in this work and propose cross-swarm bundling as a mechanism for file distribution. One key contribution of this work is that, instead of using bundling as means to extend the lifetime of swarms, we show that it can be useful to improve the Quality of Experience (QoE). We validate our proposal using a trace of Ubuntu One, a real personal cloud system, obtaining significant improvements on the QoE levels.



Created
Size
Downloads
2017-01-30
483.28 KB
1,504.00
pdf.png NGDBSCAN: Scalable DensityBased Clustering for Arbitrary Data HOT

Download

Alessandro Lulli, Matteo Dell’Amico, Pietro Michiardi, Laura Ricci

Proceedings of the VLDB Endowment (VLDB '16)

We present NG-DBSCAN, an approximate density-based clustering algorithm that operates on arbitrary data and any symmetric distance measure. The distributed design of our algorithm makes it scalable to very large datasets; its approximate nature makes it fast, yet capable of producing high quality clustering results. We provide a detailed overview of the steps of NG-DBSCAN, together with their analysis. Our results, obtained through an extensive experimental campaign with real and synthetic data, substantiate our claims about NG-DBSCAN’s performance and scalability.



Created
Size
Downloads
2017-01-30
1.08 MB
1,810.00
pdf.png GivingWings to Your Data: A First Experience of Personal Cloud Interoperability

Raúl Gracia-Tinedo, Cristian Cotes, Edgar Zamora-Gómez, Genís Ortiz, Adrián Moreno-Martínez, Marc Sánchez-Artigas, Pedro García-López, Raquel Sánchez, Alberto Gómez and Anastasio Illiana

Elsevier Future Generation Computer Systems (2017)

Personal Clouds are becoming increasingly popular storage services for end-users and organizations. However, the competition among Personal Clouds, their proprietary nature and the heterogeneity of synchronization protocols have led to a complete lack of interoperability among them. Regrettably, this situation impedes that users share data transparently across multiple providers. Even worse, the lack of interoperability has associated serious risks, such as vendor lock-in, in which users get trapped in a single provider due to the cost of switching to another one.

In thiswork,we contribute DataWings: The first interoperability protocol for Personal Clouds. DataWings consists of an authentication management protocol and a storage API for file storage, synchronization and sharing that adhere to the current authentication (OAuth) and REST standards, respectively. Moreover, we demonstrate the feasibility of DataWings by implementing the protocol in various providers (NEC, StackSync, eyeOS) and performing a real deployment evaluated with real trace replays of production systems (UbuntuOne, NEC). To our knowledge, this is the first real-world experience of Personal Cloud interoperability. Our experiments provide new insights on the performance implications that different types of user activity and the underlying sharing network topology have on the implementation of our protocol. We conclude that DataWings is flexible enough to leverage interoperability for heterogeneous Personal Clouds, opening the door for a broader adoption by other vendors.



Created
Size
Downloads
2017-01-30

0.00
pdf.png Oblivious RAM as a Substrate for Cloud Storage -- The Leakage Challenge Ahead HOT

Download

Marc Sánchez-Artigas

ACM Cloud Computing Security Workshop (ACM CCSW '16)

Oblivious RAM (ORAM) is a well-established technology to hide data access patterns from an untrusted storage system. Although research in ORAM has been spurred in the last few years with the irruption of cloud computing, it is still unclear whether ORAM is ready for the cloud. As we demonstrate in this short paper, there are still some important hurdles to be overcome. One of those is the standard block-based ORAM interface, which can become a timing side-channel when used as a substrate to implement higher level abstractions such as filesystems, personal storage services, etc., typically found in the cloud. We analyze this form of leakage and discuss some possible solutions to this problem, concluding that thwarting it in an efficient manner calls for further research.



Created
Size
Downloads
2017-01-30
521.43 KB
1,528.00
pdf.png Flexible Scheduling of Distributed Analytic Applications HOT

Download

FrancescoPace, Daniele Venzano, Damiano Carra and Pietro Michiardi

IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID'17)

This work addresses the problem of scheduling user-defined analytic applications, which we define as high-level compositions of frameworks, their components, and the logic necessary to carry out work. The key idea in our application definition, is to distinguish classes of components, including rigid and elastic types: the first being required for an application to make progress, the latter contributing to reduced execution times. We show that the problem of scheduling such applications poses new challenges, which existing approaches address inefficiently. Thus, we present the design and evaluation of a novel, flexible heuristic to schedule analytic applications, that aims at high system responsiveness, by allocating resources efficiently. Our algorithm is evaluated using trace-driven simulations, with largescale real system traces: our flexible scheduler outperforms a baseline approach across a variety of metrics, including application turnaround times, and resource allocation efficiency. We also pre sent the design and evaluation of a full-fledged system, which we have called Zoe, that incorporates the ideas presented in this paper, and report concrete improvements in terms of efficiency and performance, with respect to prior generations of our system.



Created
Size
Downloads
2018-01-30
413.48 KB
1,173.00
pdf.png Improving OpenStack Swift interaction with the I/O Stack to enable Software Defined Storage

Ramon Nou, Alberto Miranda, Marc Siquier and Toni Cortes

The 7th IEEE International Symposium on Cloud and Service Computing (SC2 '17)

This paper analyses how OpenStack Swift, a distributed object storage service for a globally used middleware, interacts with the I/O subsystem through the Operating System. This interaction, which seems organised and clean on the middleware side, becomes disordered on the device side when using mechanical disk drives, due to the way threads are used internally to request data. We will show that only modifying the Swift threading model we achieve an 18% mean improvement in performance with objects larger than 512 KiB and obtain a similar performance with smaller objects. Compared to the original scenario, the performance obtained on both scenarios is obtained in a fair way: the bandwidth is shared equally between concurrently accessed objects. Moreover, this threading model allows us to apply techniques for Software Defined Storage (SDS). We show an implementation of a Bandwidth Differentiation technique that can control each data stream and that guarantees a high utilization of the device.



Created
Size
Downloads
2018-02-15

0.00
pdf.png Stocator: Providing High Performance and Fault Tolerance for Apache Spark over Object Storage

Gil Vernik, Michael Factor, Elliot K. Kolodner, Effi Ofer, Pietro Michiardi and Francesco Pace

IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID'18)

We present Stocator, a high performance object store connector for Apache Spark, that takes advantage of object store semantics. Previous connectors have assumed file system semantics, in particular, achieving fault tolerance and allowing speculative execution by creating temporary files to avoid interference between worker threads executing the same task and then renaming these files. Rename is not a native object store operation; not only is it not atomic, but it is implemented using a costly copy operation and a delete. Instead our connector leverages the inherent atomicity of object creation, and by avoiding the rename paradigm it greatly decreases the number of operations on the object store as well as enabling a much simpler approach to dealing with the eventually consistent semantics typical of object stores. We have implemented Stocator and shared it in open source. Performance testing shows that it is as much as 18 times faster for write intensive workloads and performs as much as 30 times fewer operations on the object store than the legacy Hadoop connectors, reducing costs both for the client and the object storage service provider.



Created
Size
Downloads
2018-02-15

0.00
pdf.png Too Big to Eat: Boosting Analytics Data Ingestion from Object Stores with Scoop

Yosef Moatti, Eran Rom, Raúl Gracia Tinedo, Dalit Naor, Doron Chen, Josep Sampé, Marc Sánchez Artigas, Pedro García López, Filip Gluszak, Eric Deschdt, Francesco Pace, Daniele Venzano and Pietro Michiardi

IEEE International Conference on Data Engineering (ICDE '17)

Extracting value from data stored in object stores,such as OpenStack Swift and Amazon S3, can be problematicin common scenarios where analytics frameworks and objectstores run in physically disaggregated clusters. One of the mainproblems is that analytics frameworks must ingest large amountsof data from the object store prior to the actual computation;this incurs a significant resources and performance overhead. Toovercome this problem, we present Scoop. Scoop enables analyticsframeworks to benefit from the computational resources of objectstores to optimize the execution of analytics jobs. Scoop achievesthis by enabling the addition of ETL-type actions to the dataupload path and by offloading querying functions to the objectstore through a rich and extensible active object storage layer. Asa proof-of-concept, Scoop enables Apache Spark SQL selectionsand projections to be executed close to the data in OpenStackSwift for accelerating analytics workloads of a smart energy gridcompany (GridPocket). Our experiments in a 63-machine clusterwith real IoT data and SQL queries from GridPocket show thatScoop exhibits query execution times up to 30x faster than thetraditional “ingest-then-compute” approach.



Created
Size
Downloads
2018-02-15

0.00
You are here: Home Publications